1 Introduction

The following analysis is conducted on data describing production of steel sheets and the presence of any errors – classes 4, 14, and 15 – on their surfaces. Data used to inform this anlysis was reduced from many larger original sets. Most dropped variables were removed based on the analysis in steps 1-5 of the data reduction and treatment stage, although some were also removed at the instruction of others and these are noted with the variable removals. This, and all earlier analysis, was done in an attempt to both reproduce those results found by Prof. Wilhelm et. al. and to improve on any previous analysis done in the original “ProtMod” file.

During these steps, almost all data files were successfully reproduced, excluding one file in step three, and the orginal identifying index matrix. With regard to the index matrix - all results which could be reproduced were done so accurately, but there exist three files in the final workspace data set which were not shown to be created in the orginal code file. Without requisite information to recreate these files, it would be possible to incorrectly identify observations in future analysis, and as such the orginal identifying matrix produced by Prof. Wilhelm was used throughout my anaylsis. Conversely, the data created in step 3 which did not match to previously produced data was included in my analysis instead of the orginal data file produced by Professor Wilhelm. These files differed by approx. 3,700 observations of a total approx. 33,800.

2 Importing Libraries

rm(list=ls())
library(readxl)
library(ggplot2)
library(plyr)
library(data.table)
library(tidyverse)
library(randomForest) #for random forests
library(caret) # for CV folds and data splitting
library(GGally)
library(MASS)
library(car)
library(party)
library(partykit)
library(xtable)
library(knitr)
library(kableExtra)
library(summarytools)
library(gridExtra)

2.1 Loading Data

load("../anna_data/anna_merged_data_1.Rdata")

2.2 Treating Data

#Lists of Var type by name - Index variables, Numeric variables, and Categorical variables
var.index <- c("MAT_IDENT", "lTileID", "CoilID")
var.num <- colnames(df[,sapply(df,is.numeric)])
var.factor <- c("VORG_HAUPTAGGREGAT.x", "BPW_ERZEUGUNG", "FLAEMMGRAD_IST", "TAUCHAUSGUSS")

#Reordering by index variables
df <- df %>%
  dplyr::select(var.index, everything())

#Variable Reduction
#Dropping sf, RIEGELLAENGE.max.slab, & ltile_length
df <- df %>%
  dplyr::select(-c(sf, RIEGELLAENGE.max.slab, lTile_length))
#Send to long format; review missing patterns
df_long <- df %>%
  tidyr::gather(key=length_attr, value=measurement, -c(MAT_IDENT, lTileID, CoilID))

2.2.0.1 Computing Descriptives

Computed below, by variable, is the number of observations (“count”), the number of unique observations (“unique”), the number of missing values (“na”), and the number of non-missing entries (“N”).

Descriptives - Review for Constant Variables
length_attr count unique na N
ANST__VS_HG_3__IR__S 26882 621 10513 16369
ANST__VS_SP_3__IR__S 26882 543 10513 16369
ARGON_DRUCK_ST 26882 426 1718 25164
ARGON_DURCHFL_DUSCH 26882 3146 1718 25164
ARGON_DURCHFL_ST 26882 897 1718 25164
CHARGEN_NR 26882 249 0 26882
Class.14 26882 17 0 26882
Class.15 26882 20 0 26882
Class.4 26882 27 0 26882
DICKE__AL__IR__S 26882 16028 10513 16369
DICKE__FB__IR__S 26882 16028 10513 16369
DICKE__HA_1__IR__S 26882 16086 10513 16369
DICKE__HA_2__IR__S 26882 15569 10513 16369
DICKE__VB__IR__S 26882 596 10513 16369
DT_FS 26882 1860 1718 25164
DT_LS 26882 1952 1718 25164
DT_SSL 26882 1753 1718 25164
DT_SSR 26882 1613 1718 25164
ENTZ__FS_ZW_F1__IR__S 26882 89 10513 16369
ENTZ__FS_ZW_F2__IR__S 26882 96 10513 16369
ENTZ__ZW_OF_AL__IR__S 26882 4 10513 16369
ENTZ__ZW2_AL__IR__S 26882 61 10513 16369
ENTZ__ZWR1_AL_SN2__IR__S 26882 23 10513 16369
ENTZ__ZWR1_EL_SN1__IR__S 26882 25 10513 16369
ENTZ__ZWR1_EL_SN3__IR__S 26882 2 10513 16369
FUELLSTAND 26882 97 1718 25164
FUELLSTAND_VHUZ 26882 97 1718 25164
KEIL25__FB__IR__S 26882 15190 10513 16369
KEIL40__FB__IR__S 26882 15755 10513 16369
KEIL50__FB__IR__S 26882 15190 10513 16369
KONI_LINKS 26882 174 1718 25164
KONI_RECHTS 26882 181 1718 25164
Length.max.slab 26882 28 4034 22848
NETTO_PFANNENINHALT 26882 11689 1718 25164
PLATTENDICKE_SSL 26882 19 1718 25164
PLATTENDICKE_SSR 26882 19 1718 25164
POSITION_X.x 26882 25118 1718 25164
POSITION_X.y 26882 17482 4034 22848
PR_40__FB__IR__S 26882 15842 10513 16369
RIEGELLAENGE 26882 14346 1718 25164
RISS__HA_AS__IR__S 26882 4 10513 16369
RISS__HA_BS__IR__S 26882 9 10513 16369
STOPFENSTELLUNG 26882 262 1718 25164
STRANGBREITE 26882 118 1718 25164
STRANGNUMMER 26882 3 1718 25164
TEMP__FB__IR__S 26882 16109 10513 16369
TEMP__FB_1__IR__S 26882 16109 10513 16369
TEMP__FB_2__IR__S 26882 16109 10513 16369
TEMP__FB_3__IR__S 26882 16111 10513 16369
TEMP__HA__IR__S 26882 16109 10513 16369
TEMP__HA__OS__IR__S 26882 16112 10513 16369
TEMP__HA__SR__MAX 26882 127 10513 16369
TEMP__HA__SR__MIN 26882 128 10513 16369
TEMP__HA__SR__S 26882 127 10513 16369
TEMP__HA_1__IR__S 26882 16112 10513 16369
TEMP__HA_2__IR__S 26882 16112 10513 16369
TEMP__HA_4__IR__S 26882 16118 10513 16369
TEMP__HA_5__IR__S 26882 16099 10513 16369
TEMP__VB__IR__S 26882 632 10513 16369
TEMP__VB_1__IR__S 26882 632 10513 16369
TEMP__VB_2__IR__S 26882 632 10513 16369
TEMP__VB_3__IR__S 26882 632 10513 16369
TEMP__VB_4__IR__S 26882 632 10513 16369
TEMP__VB_5__IR__S 26882 651 10513 16369
TM_FS_M 26882 3979 1718 25164
TM_FS_SSL 26882 3763 1718 25164
TM_FS_SSR 26882 3874 1718 25164
TM_LS_M 26882 3614 1718 25164
TM_LS_SSL 26882 3636 1718 25164
TM_LS_SSR 26882 3879 1718 25164
TM_SSL_FS 26882 4216 1718 25164
TM_SSL_LS 26882 3632 1718 25164
TM_SSR_FS 26882 3894 1718 25164
TM_SSR_LS 26882 3471 1718 25164
TO_FS_M 26882 4911 1718 25164
TO_FS_SSL 26882 5124 2620 24262
TO_FS_SSR 26882 4557 1718 25164
TO_LS_M 26882 4319 1718 25164
TO_LS_SSL 26882 4276 1718 25164
TO_LS_SSR 26882 4372 1718 25164
TO_SSL_FS 26882 4991 1718 25164
TO_SSL_LS 26882 4879 1718 25164
TO_SSR_FS 26882 5008 1718 25164
TO_SSR_LS 26882 4851 1718 25164
TU_FS_M 26882 2938 2298 24584
TU_FS_SSL 26882 3048 1718 25164
TU_FS_SSR 26882 2948 1718 25164
TU_LS_M 26882 2815 1959 24923
TU_LS_SSL 26882 2945 2736 24146
TU_LS_SSR 26882 2976 1959 24923
TU_SSL_FS 26882 3365 1718 25164
TU_SSL_LS 26882 3206 1718 25164
TU_SSR_FS 26882 3205 1718 25164
TU_SSR_LS 26882 3189 1718 25164
TUNDISH_POSITION 26882 32 1718 25164
V__FS_G1__IR__S 26882 1296 10513 16369
V__FS_G2__IR__S 26882 2356 10513 16369
V__FS_G3__IR__S 26882 4205 10513 16369
V__FS_G4__IR__S 26882 6762 10513 16369
V__FS_G5__IR__S 26882 10559 10513 16369
V__FS_G6__IR__S 26882 13621 10513 16369
V__FS_G7__IR__S 26882 16100 10513 16369
VERTEILERFUELLSTAND 26882 677 1718 25164
VG 26882 571 1718 25164
VORBRAMME 26882 22 0 26882
VORG_HAUPTAGGREGAT 26882 3 1718 25164
WASSER_FS 26882 1091 1718 25164
WASSER_LS 26882 859 1718 25164
WASSER_SSL 26882 302 1718 25164
WASSER_SSR 26882 454 1718 25164
WK__FS_G1__IR__S 26882 1296 10513 16369
WK__FS_G2__IR__S 26882 2356 10513 16369
WK__FS_G3__IR__S 26882 4205 10513 16369
WK__FS_G4__IR__S 26882 6762 10513 16369
WK__FS_G5__IR__S 26882 10559 10513 16369
WK__FS_G6__IR__S 26882 13621 10513 16369
WK__FS_G7__IR__S 26882 16169 10513 16369
WK__VS_HG_3__IR__S 26882 621 10513 16369
WK__VS_SP_3__IR__S 26882 396 10513 16369
WSPALT__FS_G1__IR__S 26882 1296 10513 16369
WSPALT__FS_G2__IR__S 26882 2356 10513 16369
WSPALT__FS_G3__IR__S 26882 4211 10513 16369
WSPALT__FS_G4__IR__S 26882 6780 10513 16369
WSPALT__FS_G5__IR__S 26882 10604 10513 16369
WSPALT__FS_G6__IR__S 26882 13687 10513 16369
WSPALT__FS_G7__IR__S 26882 16169 10513 16369

With the above table we can review the number of missing values in each variable, if any patterns arise from these missings, and if there are any constant variables — variables without variation. We drop all constant variables, as by their own lack of variation they are not meaningful in explaining the variation in other variables.

#Drop constant vars
var.all.const <- df_desc%>%
  dplyr::filter(unique == 1) %>%
  dplyr::select(length_attr) %>%
  unlist() %>%
  as.character()

length(var.all.const)
## [1] 0

When reviewing var.all.const it becomes clear that there are no constant variables, so there is nothing here to drop.
With regard to the missing values, we can see from the table that there appears to be a rather obvious pattern; variables in our data almost all either have 10513, 1718, or no missing values, although certain varibles, which will be listed below, do not follow this pattern. As our data is identified across multiple levels of specificity (ie slab, tile, etc) this implies that certain groups of variables are collected only within certain scopes. As it may be the case that data for certain variables may be collected concurently, variables should be reviewed for correlation.

  • Variables not following an above-referenced missing pattern:
    • 4034: POSITION_X.y, Length.max.slab
    • 2736: TU_LS_SSR
    • 2620: TO_FS_SSL
    • 2298: TU_FS_M
    • 1959: TU_LS_M, TU_LS_SSR

Next, we seperate the variables by data type, such as integer (int), categorical (factor), and numeric. The data frame is than split on those types.

#var lists by type
var.list <- colnames(df) #names of all variables
var.int <- colnames(df[,sapply(df,is.integer)]) #integer variables
var.factor <- colnames(df[,sapply(df,is.factor)]) #categorical vars
var.num <- colnames(df[,sapply(df,is.numeric)]) #numeric vars

#splitting DF by type
df_factor <- df %>%
  dplyr::select(var.index,var.factor)
df_num <- df %>%
  dplyr::select(var.num, -var.int)

2.3 Exploring Numeric Variables

df_num_long <- df_num %>%
  gather(key=slab_attr, value=measurement)
#descriptives here are group over the entire df

2.3.1 Computing Descriptives of Numeric Variables

Numeric Variable Descriptives
slab_attr count mean sd min max unique na N max_freq min_freq freq_ratio
ANST__VS_HG_3__IR__S 26882 4.564310e-02 2.360949e-01 0.000000e+00 2.048611e+00 621 10513 16369 15750 1 25.3607085
ANST__VS_SP_3__IR__S 26882 1.861879e+00 9.709546e+00 0.000000e+00 8.524173e+01 543 10513 16369 15760 1 29.0220994
ARGON_DRUCK_ST 26882 5.986625e+01 1.870098e+01 1.800000e+01 1.000000e+02 426 1718 25164 1076 1 2.5234742
ARGON_DURCHFL_DUSCH 26882 1.152141e+02 3.499762e+01 5.200000e-01 1.900000e+02 4033 1718 25164 2822 1 0.6994793
ARGON_DURCHFL_ST 26882 8.326040e+00 5.982072e-01 3.410000e+00 1.108000e+01 1050 1718 25164 1952 1 1.8580952
CHARGEN_NR 26882 4.490450e+05 2.574589e+05 1.631710e+05 7.226910e+05 249 0 26882 1134 1 4.5502008
Class.14 26882 2.823450e-02 4.637674e-01 0.000000e+00 3.700000e+01 17 0 26882 26664 1 1568.4117647
Class.15 26882 2.652330e-02 4.595173e-01 0.000000e+00 2.800000e+01 20 0 26882 26684 1 1334.1500000
Class.4 26882 4.746671e-01 1.265641e+00 0.000000e+00 3.500000e+01 27 0 26882 20569 1 761.7777778
CoilID 26882 1.914100e+07 3.678003e+05 1.865870e+07 2.002710e+07 657 0 26882 321 1 0.4870624
DICKE__AL__IR__S 26882 1.511021e-01 4.236360e-02 0.000000e+00 6.130659e-01 16028 10513 16369 343 1 0.0213377
DICKE__FB__IR__S 26882 1.511021e-01 4.236360e-02 0.000000e+00 6.130659e-01 16028 10513 16369 343 1 0.0213377
DICKE__HA_1__IR__S 26882 1.516607e-01 4.182080e-02 0.000000e+00 6.130659e-01 16086 10513 16369 285 1 0.0176551
DICKE__HA_2__IR__S 26882 1.459174e-01 4.722860e-02 0.000000e+00 4.192426e-01 15569 10513 16369 802 1 0.0514484
DICKE__VB__IR__S 26882 4.809820e-02 2.543457e-01 0.000000e+00 2.269721e+00 596 10513 16369 15775 1 26.4664430
DT_FS 26882 6.856955e+01 4.827426e+00 5.070000e+01 8.510000e+01 2260 1718 25164 533 1 0.2353982
DT_LS 26882 7.040780e+01 5.088746e+00 5.050000e+01 8.580000e+01 2415 1718 25164 459 1 0.1896480
DT_SSL 26882 5.877077e+01 4.419236e+00 4.358000e+01 6.970000e+01 2175 1718 25164 421 1 0.1931034
DT_SSR 26882 5.354157e+01 3.564115e+00 4.070000e+01 6.600000e+01 2055 1718 25164 486 1 0.2360097
ENTZ__FS_ZW_F1__IR__S 26882 3.392100e-03 1.191530e-02 0.000000e+00 8.333330e-02 89 10513 16369 15071 1 169.3258427
ENTZ__FS_ZW_F2__IR__S 26882 5.956500e-03 1.562020e-02 0.000000e+00 8.333330e-02 96 10513 16369 14176 1 147.6562500
ENTZ__ZW_OF_AL__IR__S 26882 3.000000e-06 2.196000e-04 0.000000e+00 1.639340e-02 4 10513 16369 16366 1 4091.2500000
ENTZ__ZW2_AL__IR__S 26882 1.527500e-03 7.791700e-03 0.000000e+00 6.666670e-02 61 10513 16369 15736 1 257.9508197
ENTZ__ZWR1_AL_SN2__IR__S 26882 1.111000e-04 2.261600e-03 0.000000e+00 6.666670e-02 23 10513 16369 16328 1 709.8695652
ENTZ__ZWR1_EL_SN1__IR__S 26882 1.006000e-04 2.112400e-03 0.000000e+00 6.666670e-02 25 10513 16369 16330 1 653.1600000
ENTZ__ZWR1_EL_SN3__IR__S 26882 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 2 10513 16369 16369 16369 0.0000000
FUELLSTAND 26882 7.497974e+01 6.737898e-01 7.200000e+01 7.850000e+01 97 1718 25164 12711 1 131.0309278
FUELLSTAND_VHUZ 26882 7.497974e+01 6.737898e-01 7.200000e+01 7.850000e+01 97 1718 25164 12711 1 131.0309278
KEIL25__FB__IR__S 26882 -2.246960e-02 8.339448e-01 -4.823868e+00 3.480015e+00 15190 10513 16369 1181 1 0.0776827
KEIL40__FB__IR__S 26882 -1.268257e-01 7.662683e-01 -4.087126e+00 3.306016e+00 15755 10513 16369 616 1 0.0390352
KEIL50__FB__IR__S 26882 -4.375930e-02 8.002416e-01 -4.844529e+00 3.428814e+00 15190 10513 16369 1181 1 0.0776827
KONI_LINKS 26882 1.119920e+01 6.437812e-01 3.700000e+00 1.867500e+01 204 1718 25164 10203 1 50.0098039
KONI_RECHTS 26882 1.120665e+01 6.602210e-01 3.700000e+00 2.070000e+01 221 1718 25164 8391 1 37.9638009
Length.max.slab 26882 1.265407e+03 5.041442e+01 5.280000e+02 1.312000e+03 30 4034 22848 9384 3 312.7000000
lTileID 26882 2.433464e+02 1.369660e+02 1.000000e+00 5.100000e+02 511 7 26875 113 1 0.2191781
MAT_IDENT 26882 2.985188e+07 3.937609e+05 2.917029e+07 3.077100e+07 657 0 26882 321 1 0.4870624
NETTO_PFANNENINHALT 26882 1.648394e+02 7.692643e+01 0.000000e+00 4.040000e+02 12751 1718 25164 14 1 0.0010195
PLATTENDICKE_SSL 26882 4.829999e+01 2.019073e+00 4.320000e+01 5.000000e+01 40 1718 25164 8017 15 200.0500000
PLATTENDICKE_SSR 26882 4.745850e+01 2.089287e+00 4.329000e+01 5.000000e+01 35 1718 25164 5988 3 171.0000000
POSITION_X.x 26882 5.979768e+02 3.447785e+02 5.175509e+00 1.841143e+03 25133 1718 25164 2 1 0.0000398
POSITION_X.y 26882 5.420162e+02 3.192708e+02 3.703906e+00 1.289000e+03 17499 4034 22848 72 1 0.0040574
PR_40__FB__IR__S 26882 1.800993e+00 7.283150e-01 0.000000e+00 8.111780e+00 15842 10513 16369 529 1 0.0333291
RIEGELLAENGE 26882 5.140679e+00 2.881748e+00 5.400000e-02 1.158800e+01 17836 1718 25164 12 1 0.0006167
RISS__HA_AS__IR__S 26882 3.270000e-05 3.355500e-03 0.000000e+00 4.107143e-01 4 10513 16369 16367 1 4091.5000000
RISS__HA_BS__IR__S 26882 1.509000e-04 8.996600e-03 0.000000e+00 7.083333e-01 9 10513 16369 16362 1 1817.8888889
STOPFENSTELLUNG 26882 5.489166e+01 5.534024e+00 4.300000e+01 7.000000e+01 262 1718 25164 2270 1 8.6603053
STRANGBREITE 26882 2.490284e+03 1.190497e+02 2.151000e+03 2.577000e+03 118 1718 25164 9198 1 77.9406780
STRANGNUMMER 26882 1.460777e+00 4.984691e-01 1.000000e+00 2.000000e+00 3 1718 25164 13569 11595 658.0000000
TEMP__FB__IR__S 26882 4.497819e+01 1.231432e+01 0.000000e+00 1.956925e+02 16109 10513 16369 262 1 0.0162021
TEMP__FB_1__IR__S 26882 4.488583e+01 1.228837e+01 0.000000e+00 1.951373e+02 16109 10513 16369 262 1 0.0162021
TEMP__FB_2__IR__S 26882 4.501117e+01 1.232453e+01 0.000000e+00 1.956925e+02 16109 10513 16369 262 1 0.0162021
TEMP__FB_3__IR__S 26882 4.485969e+01 1.367100e+01 0.000000e+00 4.520970e+02 16111 10513 16369 260 1 0.0160760
TEMP__HA__IR__S 26882 3.136023e+01 8.657163e+00 0.000000e+00 1.549556e+02 16109 10513 16369 262 1 0.0162021
TEMP__HA__OS__IR__S 26882 3.127105e+01 8.688989e+00 0.000000e+00 1.550227e+02 16112 10513 16369 259 1 0.0160129
TEMP__HA__SR__MAX 26882 3.561846e+01 4.902349e+01 0.000000e+00 6.400000e+02 127 10513 16369 831 1 6.5354331
TEMP__HA__SR__MIN 26882 3.339235e+01 4.595951e+01 0.000000e+00 6.000000e+02 128 10513 16369 831 1 6.4843750
TEMP__HA__SR__S 26882 3.450541e+01 4.749150e+01 0.000000e+00 6.200000e+02 127 10513 16369 831 1 6.5354331
TEMP__HA_1__IR__S 26882 3.137055e+01 8.701447e+00 0.000000e+00 1.549556e+02 16112 10513 16369 259 1 0.0160129
TEMP__HA_2__IR__S 26882 3.127105e+01 8.688989e+00 0.000000e+00 1.550227e+02 16112 10513 16369 259 1 0.0160129
TEMP__HA_4__IR__S 26882 2.891099e+01 7.921209e+00 0.000000e+00 1.079578e+02 16118 10513 16369 253 1 0.0156347
TEMP__HA_5__IR__S 26882 3.086202e+01 8.576830e+00 0.000000e+00 1.451340e+02 16099 10513 16369 272 1 0.0168333
TEMP__VB__IR__S 26882 1.603233e+00 8.225923e+00 0.000000e+00 7.206003e+01 632 10513 16369 15739 1 24.9018987
TEMP__VB_1__IR__S 26882 1.602526e+00 8.220138e+00 0.000000e+00 7.206003e+01 632 10513 16369 15739 1 24.9018987
TEMP__VB_2__IR__S 26882 1.601913e+00 8.216825e+00 0.000000e+00 7.172220e+01 632 10513 16369 15739 1 24.9018987
TEMP__VB_3__IR__S 26882 1.610312e+00 8.259578e+00 0.000000e+00 7.185537e+01 632 10513 16369 15739 1 24.9018987
TEMP__VB_4__IR__S 26882 1.611833e+00 8.267230e+00 0.000000e+00 7.199982e+01 632 10513 16369 15739 1 24.9018987
TEMP__VB_5__IR__S 26882 1.699921e+00 8.578367e+00 0.000000e+00 7.287423e+01 651 10513 16369 15720 1 24.1459293
TM_FS_M 26882 1.288398e+02 1.178362e+01 9.320000e+01 1.566000e+02 5450 1718 25164 62 1 0.0111927
TM_FS_SSL 26882 1.277399e+02 1.056593e+01 9.333333e+01 1.597000e+02 5226 1718 25164 56 1 0.0105243
TM_FS_SSR 26882 1.285015e+02 1.120327e+01 9.915000e+01 1.614500e+02 5206 1718 25164 71 1 0.0134460
TM_LS_M 26882 1.265216e+02 1.003511e+01 8.960000e+01 1.510000e+02 4937 1718 25164 73 1 0.0145838
TM_LS_SSL 26882 1.318702e+02 1.032792e+01 1.014000e+02 1.628667e+02 4988 1718 25164 75 1 0.0148356
TM_LS_SSR 26882 1.321628e+02 1.085231e+01 9.990000e+01 1.691000e+02 5321 1718 25164 73 1 0.0135313
TM_SSL_FS 26882 1.453342e+02 1.446186e+01 1.041750e+02 1.791000e+02 5156 1718 25164 77 1 0.0147401
TM_SSL_LS 26882 1.329567e+02 1.021378e+01 1.073667e+02 1.645000e+02 4517 1718 25164 109 1 0.0239097
TM_SSR_FS 26882 1.376102e+02 1.083885e+01 1.031500e+02 1.767000e+02 4857 1718 25164 79 1 0.0160593
TM_SSR_LS 26882 1.301028e+02 8.983733e+00 1.016667e+02 1.598000e+02 4312 1718 25164 94 1 0.0215677
TO_FS_M 26882 1.846690e+02 1.613626e+01 1.339333e+02 2.244500e+02 6683 1718 25164 47 1 0.0068831
TO_FS_SSL 26882 1.885192e+02 2.793312e+01 8.000000e-01 7.977000e+02 6835 2620 24262 51 1 0.0073153
TO_FS_SSR 26882 1.913317e+02 1.379441e+01 1.441750e+02 2.262500e+02 6299 1718 25164 58 1 0.0090491
TO_LS_M 26882 1.863830e+02 1.294765e+01 1.388000e+02 2.216500e+02 5884 1718 25164 69 1 0.0115568
TO_LS_SSL 26882 1.961198e+02 1.250285e+01 1.511500e+02 2.289000e+02 5864 1718 25164 63 1 0.0105730
TO_LS_SSR 26882 1.949970e+02 1.264377e+01 1.455000e+02 2.248000e+02 5991 1718 25164 65 1 0.0106827
TO_SSL_FS 26882 2.052935e+02 1.576007e+01 1.476000e+02 2.410000e+02 6657 1718 25164 53 1 0.0078113
TO_SSL_LS 26882 2.015116e+02 1.541129e+01 1.503667e+02 2.377000e+02 6521 1718 25164 50 1 0.0075142
TO_SSR_FS 26882 2.030546e+02 1.587309e+01 1.448667e+02 2.412000e+02 6750 1718 25164 57 1 0.0082963
TO_SSR_LS 26882 1.983856e+02 1.474387e+01 1.408250e+02 2.367500e+02 6523 1718 25164 56 1 0.0084317
TU_FS_M 26882 1.116696e+02 7.413646e+00 8.015000e+01 1.304500e+02 4243 2298 24584 92 1 0.0214471
TU_FS_SSL 26882 1.087951e+02 7.716555e+00 8.368000e+01 1.342000e+02 4368 1718 25164 91 1 0.0206044
TU_FS_SSR 26882 1.089810e+02 7.231841e+00 8.084000e+01 1.348500e+02 4249 1718 25164 126 1 0.0294187
TU_LS_M 26882 1.086232e+02 6.924098e+00 6.540000e+01 1.294500e+02 4090 1959 24923 100 1 0.0242054
TU_LS_SSL 26882 1.152493e+02 7.128589e+00 8.065000e+01 1.382500e+02 4182 2736 24146 105 1 0.0248685
TU_LS_SSR 26882 1.150699e+02 7.318471e+00 8.833333e+01 1.394000e+02 4300 1959 24923 83 1 0.0190698
TU_SSL_FS 26882 1.257709e+02 9.163527e+00 8.836667e+01 1.524000e+02 4481 1718 25164 76 1 0.0167373
TU_SSL_LS 26882 1.153944e+02 7.976129e+00 9.183333e+01 1.436000e+02 4297 1718 25164 94 1 0.0216430
TU_SSR_FS 26882 1.221326e+02 8.008096e+00 8.710000e+01 1.467000e+02 4318 1718 25164 86 1 0.0196850
TU_SSR_LS 26882 1.141591e+02 7.792329e+00 8.290000e+01 1.485000e+02 4290 1718 25164 117 1 0.0270396
TUNDISH_POSITION 26882 1.271549e+01 1.156004e+01 0.000000e+00 4.200000e+01 32 1718 25164 4115 1 128.5625000
V__FS_G1__IR__S 26882 3.216581e-01 1.135128e+00 0.000000e+00 8.171358e+00 1296 10513 16369 15075 1 11.6311728
V__FS_G2__IR__S 26882 9.235807e-01 2.339188e+00 0.000000e+00 1.247841e+01 2356 10513 16369 14015 1 5.9482173
V__FS_G3__IR__S 26882 2.603245e+00 4.623162e+00 0.000000e+00 2.758184e+01 4205 10513 16369 12166 1 2.8929845
V__FS_G4__IR__S 26882 6.294015e+00 7.883289e+00 0.000000e+00 4.301559e+01 6762 10513 16369 9609 1 1.4208814
V__FS_G5__IR__S 26882 1.486168e+01 1.202346e+01 0.000000e+00 7.304973e+01 10559 10513 16369 5812 1 0.5503362
V__FS_G6__IR__S 26882 2.531957e+01 1.339378e+01 0.000000e+00 9.872687e+01 13621 10513 16369 2750 1 0.2018207
V__FS_G7__IR__S 26882 3.673649e+01 1.082059e+01 0.000000e+00 1.611927e+02 16100 10513 16369 271 1 0.0167702
VERTEILERFUELLSTAND 26882 7.882090e+01 1.623281e+00 6.132000e+01 8.246667e+01 844 1718 25164 1457 1 1.7251185
VG 26882 9.885491e-01 9.614080e-02 7.610000e-01 1.157000e+00 819 1718 25164 2199 1 2.6837607
VORBRAMME 26882 2.670925e+02 2.481209e+02 2.200000e+01 5.530000e+02 22 0 26882 3016 9 136.6818182
WASSER_FS 26882 2.661484e+00 1.759780e-02 2.563600e+00 2.720500e+00 1765 1718 25164 529 1 0.2991501
WASSER_LS 26882 2.669746e+00 1.256150e-02 2.595750e+00 2.719500e+00 1399 1718 25164 854 1 0.6097212
WASSER_SSL 26882 2.626474e-01 8.790700e-03 2.450000e-01 2.782000e-01 352 1718 25164 2249 1 6.3863636
WASSER_SSR 26882 2.572780e-01 1.135910e-02 2.370000e-01 2.810000e-01 535 1718 25164 1587 1 2.9644860
WK__FS_G1__IR__S 26882 4.442741e+01 1.564459e+02 0.000000e+00 1.077098e+03 1296 10513 16369 15075 1 11.6311728
WK__FS_G2__IR__S 26882 8.591459e+01 2.168795e+02 0.000000e+00 1.202033e+03 2356 10513 16369 14015 1 5.9482173
WK__FS_G3__IR__S 26882 1.582005e+02 2.797363e+02 0.000000e+00 1.679224e+03 4205 10513 16369 12166 1 2.8929845
WK__FS_G4__IR__S 26882 2.469891e+02 3.074766e+02 0.000000e+00 1.692017e+03 6762 10513 16369 9609 1 1.4208814
WK__FS_G5__IR__S 26882 3.512747e+02 2.814103e+02 0.000000e+00 1.725439e+03 10559 10513 16369 5812 1 0.5503362
WK__FS_G6__IR__S 26882 3.571932e+02 1.870432e+02 0.000000e+00 1.296698e+03 13621 10513 16369 2750 1 0.2018207
WK__FS_G7__IR__S 26882 3.018373e+02 8.826540e+01 0.000000e+00 1.437853e+03 16169 10513 16369 202 1 0.0124312
WK__VS_HG_3__IR__S 26882 2.081314e+01 1.078005e+02 0.000000e+00 9.421366e+02 621 10513 16369 15750 1 25.3607085
WK__VS_SP_3__IR__S 26882 6.795180e-02 5.596544e-01 -3.138046e-01 1.225219e+01 396 10513 16369 15964 1 40.3106061
WSPALT__FS_G1__IR__S 26882 7.454170e-02 2.619365e-01 0.000000e+00 1.830767e+00 1296 10513 16369 15075 1 11.6311728
WSPALT__FS_G2__IR__S 26882 9.341970e-02 2.353754e-01 0.000000e+00 1.252877e+00 2356 10513 16369 14015 1 5.9482173
WSPALT__FS_G3__IR__S 26882 1.157892e-01 2.043523e-01 0.000000e+00 1.208468e+00 4211 10513 16369 12160 1 2.8874377
WSPALT__FS_G4__IR__S 26882 1.295598e-01 1.611171e-01 0.000000e+00 8.493480e-01 6780 10513 16369 9591 1 1.4144543
WSPALT__FS_G5__IR__S 26882 1.390093e-01 1.109823e-01 0.000000e+00 6.135501e-01 10604 10513 16369 5767 1 0.5437571
WSPALT__FS_G6__IR__S 26882 1.355501e-01 7.061980e-02 0.000000e+00 5.206122e-01 13687 10513 16369 2684 1 0.1960254
WSPALT__FS_G7__IR__S 26882 1.647887e-01 4.480130e-02 0.000000e+00 6.212929e-01 16169 10513 16369 202 1 0.0124312

Again, we are looking for variables without variation and any interesting patterns within the data. Although above, it was determined that a variable was constant by reviewing how many unique observations it had, here variables are checked for a standard deviation of 0.

#Dropping variables with no variation
var.sd0 <- df_num_desc %>%
  dplyr::filter(sd==0) %>%
  dplyr::select(slab_attr) %>%
  unlist() %>%
  as.character()

df_num_long <- df_num_long %>%
  dplyr::filter(!slab_attr %in% var.sd0)

df_num <- df_num %>%
  dplyr::select(-var.sd0)

From the table above we see that there is only one variable with a standard deviation of zero — "ENTZ__ZWR1_EL_SN3__IR__S". This is also the only variable selected and dropped in the above code chunk. We now calculate the pairwise correlation of all our numeric variables.

2.3.2 Correlation Matrix of Numeric Variables

## 
## Correlation method: 'pearson'
## Missing treated using: 'pairwise.complete.obs'
## Registered S3 method overwritten by 'seriation':
##   method         from 
##   reorder.hclust gclus
## Don't know how to automatically pick scale for object of type noquote. Defaulting to continuous.

## Don't know how to automatically pick scale for object of type noquote. Defaulting to continuous.

The plot above shows the pairwise correlation of all variables, with postive correlation being marked in green and negative correlation marked with red. Additionally, variables are arranged such that groups of correlated variables are listed together. For all pairs which are absolutely correlated — i.e. having correlation of 1 or -1 – only one member of the pair will be kept in the final data set. When variables are totally correlated, the informtion that they provide in describing the variability of other variables is redundant. The latter half of the pair is dropped in order to account for this redundancy.

#For all pairs which are fully, absolutely correlated, we keep only one
corONE <- function(x) {
    if (is.matrix(x)) {
      cor1.df <- data.frame(which(abs(x)==1, arr.in=TRUE))
      setDT(cor1.df, keep.rownames = TRUE)[]
      cor1.list <- cor1.df$rn[which(cor1.df$row > cor1.df$col, arr.in=TRUE)]
      grx <- glob2rx("*.*")
      duplicate.list <- grepl(grx,cor1.list, perl=TRUE)
      cor1.list <- cor1.list[!duplicate.list]
    } else {
      print("no matrix!")
    }
}

#list of all totally correlated variables
cor1.list <- corONE(cormat)
write.table(cor1.list, file="anna_Length_ListofVariableswithCor1.txt", sep="\t")

#Drop corresponding columns
df_num <- df_num %>%
  dplyr::select(-cor1.list)

2.3.3 Computing New Descriptives for Numeric Variables

Numeric Variable Descriptives - Updated
slab_attr count mean sd min max unique na N
ANST__VS_HG_3__IR__S 26882 4.564310e-02 2.360949e-01 0.000000e+00 2.048611e+00 621 10513 16369
ANST__VS_SP_3__IR__S 26882 1.861879e+00 9.709546e+00 0.000000e+00 8.524173e+01 543 10513 16369
ARGON_DRUCK_ST 26882 5.986625e+01 1.870098e+01 1.800000e+01 1.000000e+02 426 1718 25164
ARGON_DURCHFL_DUSCH 26882 1.152141e+02 3.499762e+01 5.200000e-01 1.900000e+02 4033 1718 25164
ARGON_DURCHFL_ST 26882 8.326040e+00 5.982072e-01 3.410000e+00 1.108000e+01 1050 1718 25164
CHARGEN_NR 26882 4.490450e+05 2.574589e+05 1.631710e+05 7.226910e+05 249 0 26882
Class.14 26882 2.823450e-02 4.637674e-01 0.000000e+00 3.700000e+01 17 0 26882
Class.15 26882 2.652330e-02 4.595173e-01 0.000000e+00 2.800000e+01 20 0 26882
Class.4 26882 4.746671e-01 1.265641e+00 0.000000e+00 3.500000e+01 27 0 26882
CoilID 26882 1.914100e+07 3.678003e+05 1.865870e+07 2.002710e+07 657 0 26882
DICKE__AL__IR__S 26882 1.511021e-01 4.236360e-02 0.000000e+00 6.130659e-01 16028 10513 16369
DICKE__HA_1__IR__S 26882 1.516607e-01 4.182080e-02 0.000000e+00 6.130659e-01 16086 10513 16369
DICKE__HA_2__IR__S 26882 1.459174e-01 4.722860e-02 0.000000e+00 4.192426e-01 15569 10513 16369
DICKE__VB__IR__S 26882 4.809820e-02 2.543457e-01 0.000000e+00 2.269721e+00 596 10513 16369
DT_FS 26882 6.856955e+01 4.827426e+00 5.070000e+01 8.510000e+01 2260 1718 25164
DT_LS 26882 7.040780e+01 5.088746e+00 5.050000e+01 8.580000e+01 2415 1718 25164
DT_SSL 26882 5.877077e+01 4.419236e+00 4.358000e+01 6.970000e+01 2175 1718 25164
DT_SSR 26882 5.354157e+01 3.564115e+00 4.070000e+01 6.600000e+01 2055 1718 25164
ENTZ__FS_ZW_F1__IR__S 26882 3.392100e-03 1.191530e-02 0.000000e+00 8.333330e-02 89 10513 16369
ENTZ__FS_ZW_F2__IR__S 26882 5.956500e-03 1.562020e-02 0.000000e+00 8.333330e-02 96 10513 16369
ENTZ__ZW_OF_AL__IR__S 26882 3.000000e-06 2.196000e-04 0.000000e+00 1.639340e-02 4 10513 16369
ENTZ__ZW2_AL__IR__S 26882 1.527500e-03 7.791700e-03 0.000000e+00 6.666670e-02 61 10513 16369
ENTZ__ZWR1_AL_SN2__IR__S 26882 1.111000e-04 2.261600e-03 0.000000e+00 6.666670e-02 23 10513 16369
ENTZ__ZWR1_EL_SN1__IR__S 26882 1.006000e-04 2.112400e-03 0.000000e+00 6.666670e-02 25 10513 16369
FUELLSTAND 26882 7.497974e+01 6.737898e-01 7.200000e+01 7.850000e+01 97 1718 25164
KEIL25__FB__IR__S 26882 -2.246960e-02 8.339448e-01 -4.823868e+00 3.480015e+00 15190 10513 16369
KEIL40__FB__IR__S 26882 -1.268257e-01 7.662683e-01 -4.087126e+00 3.306016e+00 15755 10513 16369
KEIL50__FB__IR__S 26882 -4.375930e-02 8.002416e-01 -4.844529e+00 3.428814e+00 15190 10513 16369
KONI_LINKS 26882 1.119920e+01 6.437812e-01 3.700000e+00 1.867500e+01 204 1718 25164
KONI_RECHTS 26882 1.120665e+01 6.602210e-01 3.700000e+00 2.070000e+01 221 1718 25164
Length.max.slab 26882 1.265407e+03 5.041442e+01 5.280000e+02 1.312000e+03 30 4034 22848
lTileID 26882 2.433464e+02 1.369660e+02 1.000000e+00 5.100000e+02 511 7 26875
MAT_IDENT 26882 2.985188e+07 3.937609e+05 2.917029e+07 3.077100e+07 657 0 26882
NETTO_PFANNENINHALT 26882 1.648394e+02 7.692643e+01 0.000000e+00 4.040000e+02 12751 1718 25164
PLATTENDICKE_SSL 26882 4.829999e+01 2.019073e+00 4.320000e+01 5.000000e+01 40 1718 25164
PLATTENDICKE_SSR 26882 4.745850e+01 2.089287e+00 4.329000e+01 5.000000e+01 35 1718 25164
POSITION_X.x 26882 5.979768e+02 3.447785e+02 5.175509e+00 1.841143e+03 25133 1718 25164
POSITION_X.y 26882 5.420162e+02 3.192708e+02 3.703906e+00 1.289000e+03 17499 4034 22848
PR_40__FB__IR__S 26882 1.800993e+00 7.283150e-01 0.000000e+00 8.111780e+00 15842 10513 16369
RIEGELLAENGE 26882 5.140679e+00 2.881748e+00 5.400000e-02 1.158800e+01 17836 1718 25164
RISS__HA_AS__IR__S 26882 3.270000e-05 3.355500e-03 0.000000e+00 4.107143e-01 4 10513 16369
RISS__HA_BS__IR__S 26882 1.509000e-04 8.996600e-03 0.000000e+00 7.083333e-01 9 10513 16369
STOPFENSTELLUNG 26882 5.489166e+01 5.534024e+00 4.300000e+01 7.000000e+01 262 1718 25164
STRANGBREITE 26882 2.490284e+03 1.190497e+02 2.151000e+03 2.577000e+03 118 1718 25164
STRANGNUMMER 26882 1.460777e+00 4.984691e-01 1.000000e+00 2.000000e+00 3 1718 25164
TEMP__FB__IR__S 26882 4.497819e+01 1.231432e+01 0.000000e+00 1.956925e+02 16109 10513 16369
TEMP__FB_1__IR__S 26882 4.488583e+01 1.228837e+01 0.000000e+00 1.951373e+02 16109 10513 16369
TEMP__FB_2__IR__S 26882 4.501117e+01 1.232453e+01 0.000000e+00 1.956925e+02 16109 10513 16369
TEMP__FB_3__IR__S 26882 4.485969e+01 1.367100e+01 0.000000e+00 4.520970e+02 16111 10513 16369
TEMP__HA__IR__S 26882 3.136023e+01 8.657163e+00 0.000000e+00 1.549556e+02 16109 10513 16369
TEMP__HA__SR__MAX 26882 3.561846e+01 4.902349e+01 0.000000e+00 6.400000e+02 127 10513 16369
TEMP__HA_1__IR__S 26882 3.137055e+01 8.701447e+00 0.000000e+00 1.549556e+02 16112 10513 16369
TEMP__HA_2__IR__S 26882 3.127105e+01 8.688989e+00 0.000000e+00 1.550227e+02 16112 10513 16369
TEMP__HA_4__IR__S 26882 2.891099e+01 7.921209e+00 0.000000e+00 1.079578e+02 16118 10513 16369
TEMP__HA_5__IR__S 26882 3.086202e+01 8.576830e+00 0.000000e+00 1.451340e+02 16099 10513 16369
TEMP__VB__IR__S 26882 1.603233e+00 8.225923e+00 0.000000e+00 7.206003e+01 632 10513 16369
TEMP__VB_1__IR__S 26882 1.602526e+00 8.220138e+00 0.000000e+00 7.206003e+01 632 10513 16369
TEMP__VB_5__IR__S 26882 1.699921e+00 8.578367e+00 0.000000e+00 7.287423e+01 651 10513 16369
TM_FS_M 26882 1.288398e+02 1.178362e+01 9.320000e+01 1.566000e+02 5450 1718 25164
TM_FS_SSL 26882 1.277399e+02 1.056593e+01 9.333333e+01 1.597000e+02 5226 1718 25164
TM_FS_SSR 26882 1.285015e+02 1.120327e+01 9.915000e+01 1.614500e+02 5206 1718 25164
TM_LS_M 26882 1.265216e+02 1.003511e+01 8.960000e+01 1.510000e+02 4937 1718 25164
TM_LS_SSL 26882 1.318702e+02 1.032792e+01 1.014000e+02 1.628667e+02 4988 1718 25164
TM_LS_SSR 26882 1.321628e+02 1.085231e+01 9.990000e+01 1.691000e+02 5321 1718 25164
TM_SSL_FS 26882 1.453342e+02 1.446186e+01 1.041750e+02 1.791000e+02 5156 1718 25164
TM_SSL_LS 26882 1.329567e+02 1.021378e+01 1.073667e+02 1.645000e+02 4517 1718 25164
TM_SSR_FS 26882 1.376102e+02 1.083885e+01 1.031500e+02 1.767000e+02 4857 1718 25164
TM_SSR_LS 26882 1.301028e+02 8.983733e+00 1.016667e+02 1.598000e+02 4312 1718 25164
TO_FS_M 26882 1.846690e+02 1.613626e+01 1.339333e+02 2.244500e+02 6683 1718 25164
TO_FS_SSL 26882 1.885192e+02 2.793312e+01 8.000000e-01 7.977000e+02 6835 2620 24262
TO_FS_SSR 26882 1.913317e+02 1.379441e+01 1.441750e+02 2.262500e+02 6299 1718 25164
TO_LS_M 26882 1.863830e+02 1.294765e+01 1.388000e+02 2.216500e+02 5884 1718 25164
TO_LS_SSL 26882 1.961198e+02 1.250285e+01 1.511500e+02 2.289000e+02 5864 1718 25164
TO_LS_SSR 26882 1.949970e+02 1.264377e+01 1.455000e+02 2.248000e+02 5991 1718 25164
TO_SSL_FS 26882 2.052935e+02 1.576007e+01 1.476000e+02 2.410000e+02 6657 1718 25164
TO_SSL_LS 26882 2.015116e+02 1.541129e+01 1.503667e+02 2.377000e+02 6521 1718 25164
TO_SSR_FS 26882 2.030546e+02 1.587309e+01 1.448667e+02 2.412000e+02 6750 1718 25164
TO_SSR_LS 26882 1.983856e+02 1.474387e+01 1.408250e+02 2.367500e+02 6523 1718 25164
TU_FS_M 26882 1.116696e+02 7.413646e+00 8.015000e+01 1.304500e+02 4243 2298 24584
TU_FS_SSL 26882 1.087951e+02 7.716555e+00 8.368000e+01 1.342000e+02 4368 1718 25164
TU_FS_SSR 26882 1.089810e+02 7.231841e+00 8.084000e+01 1.348500e+02 4249 1718 25164
TU_LS_M 26882 1.086232e+02 6.924098e+00 6.540000e+01 1.294500e+02 4090 1959 24923
TU_LS_SSL 26882 1.152493e+02 7.128589e+00 8.065000e+01 1.382500e+02 4182 2736 24146
TU_LS_SSR 26882 1.150699e+02 7.318471e+00 8.833333e+01 1.394000e+02 4300 1959 24923
TU_SSL_FS 26882 1.257709e+02 9.163527e+00 8.836667e+01 1.524000e+02 4481 1718 25164
TU_SSL_LS 26882 1.153944e+02 7.976129e+00 9.183333e+01 1.436000e+02 4297 1718 25164
TU_SSR_FS 26882 1.221326e+02 8.008096e+00 8.710000e+01 1.467000e+02 4318 1718 25164
TU_SSR_LS 26882 1.141591e+02 7.792329e+00 8.290000e+01 1.485000e+02 4290 1718 25164
TUNDISH_POSITION 26882 1.271549e+01 1.156004e+01 0.000000e+00 4.200000e+01 32 1718 25164
V__FS_G1__IR__S 26882 3.216581e-01 1.135128e+00 0.000000e+00 8.171358e+00 1296 10513 16369
V__FS_G2__IR__S 26882 9.235807e-01 2.339188e+00 0.000000e+00 1.247841e+01 2356 10513 16369
V__FS_G3__IR__S 26882 2.603245e+00 4.623162e+00 0.000000e+00 2.758184e+01 4205 10513 16369
V__FS_G4__IR__S 26882 6.294015e+00 7.883289e+00 0.000000e+00 4.301559e+01 6762 10513 16369
V__FS_G5__IR__S 26882 1.486168e+01 1.202346e+01 0.000000e+00 7.304973e+01 10559 10513 16369
V__FS_G6__IR__S 26882 2.531957e+01 1.339378e+01 0.000000e+00 9.872687e+01 13621 10513 16369
V__FS_G7__IR__S 26882 3.673649e+01 1.082059e+01 0.000000e+00 1.611927e+02 16100 10513 16369
VERTEILERFUELLSTAND 26882 7.882090e+01 1.623281e+00 6.132000e+01 8.246667e+01 844 1718 25164
VG 26882 9.885491e-01 9.614080e-02 7.610000e-01 1.157000e+00 819 1718 25164
VORBRAMME 26882 2.670925e+02 2.481209e+02 2.200000e+01 5.530000e+02 22 0 26882
WASSER_FS 26882 2.661484e+00 1.759780e-02 2.563600e+00 2.720500e+00 1765 1718 25164
WASSER_LS 26882 2.669746e+00 1.256150e-02 2.595750e+00 2.719500e+00 1399 1718 25164
WASSER_SSL 26882 2.626474e-01 8.790700e-03 2.450000e-01 2.782000e-01 352 1718 25164
WASSER_SSR 26882 2.572780e-01 1.135910e-02 2.370000e-01 2.810000e-01 535 1718 25164
WK__FS_G1__IR__S 26882 4.442741e+01 1.564459e+02 0.000000e+00 1.077098e+03 1296 10513 16369
WK__FS_G2__IR__S 26882 8.591459e+01 2.168795e+02 0.000000e+00 1.202033e+03 2356 10513 16369
WK__FS_G3__IR__S 26882 1.582005e+02 2.797363e+02 0.000000e+00 1.679224e+03 4205 10513 16369
WK__FS_G4__IR__S 26882 2.469891e+02 3.074766e+02 0.000000e+00 1.692017e+03 6762 10513 16369
WK__FS_G5__IR__S 26882 3.512747e+02 2.814103e+02 0.000000e+00 1.725439e+03 10559 10513 16369
WK__FS_G6__IR__S 26882 3.571932e+02 1.870432e+02 0.000000e+00 1.296698e+03 13621 10513 16369
WK__FS_G7__IR__S 26882 3.018373e+02 8.826540e+01 0.000000e+00 1.437853e+03 16169 10513 16369
WK__VS_HG_3__IR__S 26882 2.081314e+01 1.078005e+02 0.000000e+00 9.421366e+02 621 10513 16369
WK__VS_SP_3__IR__S 26882 6.795180e-02 5.596544e-01 -3.138046e-01 1.225219e+01 396 10513 16369
WSPALT__FS_G1__IR__S 26882 7.454170e-02 2.619365e-01 0.000000e+00 1.830767e+00 1296 10513 16369
WSPALT__FS_G2__IR__S 26882 9.341970e-02 2.353754e-01 0.000000e+00 1.252877e+00 2356 10513 16369
WSPALT__FS_G3__IR__S 26882 1.157892e-01 2.043523e-01 0.000000e+00 1.208468e+00 4211 10513 16369
WSPALT__FS_G4__IR__S 26882 1.295598e-01 1.611171e-01 0.000000e+00 8.493480e-01 6780 10513 16369
WSPALT__FS_G5__IR__S 26882 1.390093e-01 1.109823e-01 0.000000e+00 6.135501e-01 10604 10513 16369
WSPALT__FS_G6__IR__S 26882 1.355501e-01 7.061980e-02 0.000000e+00 5.206122e-01 13687 10513 16369
WSPALT__FS_G7__IR__S 26882 1.647887e-01 4.480130e-02 0.000000e+00 6.212929e-01 16169 10513 16369

Since last computing these descriptive satistics, 9 variables have been dropped. With that in mind, the data set still contains multiple variables with high pairwise correlations. As such, the pair-wise correlation is re-computed on the reduced data set. Then, those pairs with absolute correlation greater than or equal to .95 will be selected. These are variables which, pairwise, can explain at least 95% of the variability in one another. Since they are still so highly correlated, again only the first half of such pairs will be kept in the data set for further analysis.

2.3.4 Correlation Matrix of Numeric Varaibles - Updated

## 
## Correlation method: 'pearson'
## Missing treated using: 'pairwise.complete.obs'
## Don't know how to automatically pick scale for object of type noquote. Defaulting to continuous.

## Don't know how to automatically pick scale for object of type noquote. Defaulting to continuous.
filter.cor <- function(x, eps) {
    if (is.matrix(x)) {
      cor.df <- data.frame(which(abs(x) > eps, arr.in=TRUE))
      setDT(cor.df, keep.rownames = TRUE)[]
      cor.df$cor <- x[which(abs(x) > eps, arr.in=TRUE)]
      cor.df <- cor.df[which(cor.df$row > cor.df$col, arr.in=TRUE)]
      cor.df$cn <- colnames(x[, cor.df$col])
      cor.list <- cor.df$rn
      grx <- glob2rx("*.*")
      duplicate.list <- grepl(grx,cor.list, perl=TRUE)
      cor.list <- cor.list[!duplicate.list]
      cor.df$rn <- sub(pattern = "(.*)\\..*$", replacement = "\\1", cor.df$rn)
      corList <- list(CorMat = cor.df, cor.list = cor.list)
      return(corList)
    } else {
      print("no matrix!")
    }
}


df_numList <- filter.cor(cormat, eps=0.95)
df_num2 <- df_numList$CorMat
cor.list <- df_numList$cor.list
cor.list <- cor.list[!cor.list %in% c("CoilID")]
cor.list <- c(cor.list, "POSITION_X.y")
length(cor.list)
## [1] 29
#List of all variables with abs. corr >= 95%
write.table(cor.list, file="anna_length_ListofVariableswithCorLT095.txt", sep="\t")

The list of variables with pairwise absolute correlation greater than or equal to .95 contains 29 variables. Before dropping such a large number of variables from the data set, it is necessary to review the summary statistics for any interesting patterns.

2.3.5 Computing Descriptives of Variables with Absolute Correlation >= |.95|

Variables w/ abs corr >= .95
rn count runique unique cor_var na N
ANST__VS_SP_3__IR__S 1 1 1 ANST__VS_HG_3__IR__S 0 1
CoilID 1 1 1 MAT_IDENT 0 1
ENTZ__ZW2_AL__IR__S 2 1 2 ANST__VS_HG_3__IR__S, ANST__VS_SP_3__IR__S 0 2
ENTZ__ZWR1_EL_SN1__IR__S 1 1 1 ENTZ__ZWR1_AL_SN2__IR__S 0 1
KEIL50__FB__IR__S 1 1 1 KEIL25__FB__IR__S 0 1
KONI_RECHTS 1 1 1 KONI_LINKS 0 1
POSITION_X 2 1 1 lTileID, lTileID 0 2
POSITION_X.y 2 1 2 POSITION_X.x, RIEGELLAENGE 0 2
RIEGELLAENGE 2 1 2 lTileID, POSITION_X.x 0 2
STRANGNUMMER 1 1 1 VORBRAMME 0 1
TEMP__FB__IR__S 2 1 2 TEMP__FB_1__IR__S, TEMP__FB_2__IR__S 0 2
TEMP__FB_2__IR__S 1 1 1 TEMP__FB_1__IR__S 0 1
TEMP__HA__IR__S 2 1 2 TEMP__HA_1__IR__S, TEMP__HA_2__IR__S 0 2
TEMP__HA_2__IR__S 1 1 1 TEMP__HA_1__IR__S 0 1
TEMP__VB__IR__S 5 1 5 ANST__VS_HG_3__IR__S, ANST__VS_SP_3__IR__S, ENTZ__ZW2_AL__IR__S, TEMP__VB_1__IR__S, TEMP__VB_5__IR__S 0 5
TEMP__VB_1__IR__S 3 1 3 ANST__VS_HG_3__IR__S, ANST__VS_SP_3__IR__S, ENTZ__ZW2_AL__IR__S 0 3
TEMP__VB_5__IR__S 2 1 2 ENTZ__ZW2_AL__IR__S, TEMP__VB_1__IR__S 0 2
V__FS_G1__IR__S 1 1 1 ENTZ__FS_ZW_F1__IR__S 0 1
WK__FS_G1__IR__S 2 1 2 ENTZ__FS_ZW_F1__IR__S, V__FS_G1__IR__S 0 2
WK__FS_G2__IR__S 1 1 1 V__FS_G2__IR__S 0 1
WK__FS_G3__IR__S 1 1 1 V__FS_G3__IR__S 0 1
WK__FS_G4__IR__S 1 1 1 V__FS_G4__IR__S 0 1
WK__FS_G5__IR__S 1 1 1 V__FS_G5__IR__S 0 1
WK__FS_G6__IR__S 1 1 1 V__FS_G6__IR__S 0 1
WK__VS_HG_3__IR__S 5 1 5 ANST__VS_HG_3__IR__S, ANST__VS_SP_3__IR__S, ENTZ__ZW2_AL__IR__S, TEMP__VB_1__IR__S, TEMP__VB__IR__S 0 5
WSPALT__FS_G1__IR__S 3 1 3 ENTZ__FS_ZW_F1__IR__S, V__FS_G1__IR__S, WK__FS_G1__IR__S 0 3
WSPALT__FS_G2__IR__S 2 1 2 V__FS_G2__IR__S, WK__FS_G2__IR__S 0 2
WSPALT__FS_G3__IR__S 2 1 2 V__FS_G3__IR__S, WK__FS_G3__IR__S 0 2
WSPALT__FS_G4__IR__S 2 1 2 V__FS_G4__IR__S, WK__FS_G4__IR__S 0 2
WSPALT__FS_G5__IR__S 2 1 2 V__FS_G5__IR__S, WK__FS_G5__IR__S 0 2
WSPALT__FS_G6__IR__S 1 1 1 V__FS_G6__IR__S 0 1

As no immediately worrying information can be seen in the table above, all variables listed in the cor.list are dropped from the dataset.

#dropping corresponding columns
df_num <- df_num %>%
  dplyr::select(-cor.list)

2.3.6 Correlation Matrix of Numeric Variables - Final

## 
## Correlation method: 'pearson'
## Missing treated using: 'pairwise.complete.obs'
## Don't know how to automatically pick scale for object of type noquote. Defaulting to continuous.

## Don't know how to automatically pick scale for object of type noquote. Defaulting to continuous.

One can now observe clearly the differences between the original correlation matrix, and our matrix built on the reduced data set. The remaining data, although still highly correlated in some ways, is less extremely correlated in general. When comparing between the above correlation matrix, and previous matrices, note that each indivdual matrix is ordered such that highly correlated variables are listed together, which changes the order of the variables presented in each matrix, as variables are dropped.

2.3.7 Computing Descriptives of Numeric Variables - Updated

Having dropped a notable number of variables, the descriptive statistics are computed one final time for the numeric variables and are reviewed for interesting patterns.

Updated Numerical Variable Descriptives
slab_attr count mean sd min max unique na N
ANST__VS_HG_3__IR__S 26882 4.564310e-02 2.360949e-01 0.000000e+00 2.048611e+00 621 10513 16369
ARGON_DRUCK_ST 26882 5.986625e+01 1.870098e+01 1.800000e+01 1.000000e+02 426 1718 25164
ARGON_DURCHFL_DUSCH 26882 1.152141e+02 3.499762e+01 5.200000e-01 1.900000e+02 4033 1718 25164
ARGON_DURCHFL_ST 26882 8.326040e+00 5.982072e-01 3.410000e+00 1.108000e+01 1050 1718 25164
CHARGEN_NR 26882 4.490450e+05 2.574589e+05 1.631710e+05 7.226910e+05 249 0 26882
Class.14 26882 2.823450e-02 4.637674e-01 0.000000e+00 3.700000e+01 17 0 26882
Class.15 26882 2.652330e-02 4.595173e-01 0.000000e+00 2.800000e+01 20 0 26882
Class.4 26882 4.746671e-01 1.265641e+00 0.000000e+00 3.500000e+01 27 0 26882
CoilID 26882 1.914100e+07 3.678003e+05 1.865870e+07 2.002710e+07 657 0 26882
DICKE__AL__IR__S 26882 1.511021e-01 4.236360e-02 0.000000e+00 6.130659e-01 16028 10513 16369
DICKE__HA_1__IR__S 26882 1.516607e-01 4.182080e-02 0.000000e+00 6.130659e-01 16086 10513 16369
DICKE__HA_2__IR__S 26882 1.459174e-01 4.722860e-02 0.000000e+00 4.192426e-01 15569 10513 16369
DICKE__VB__IR__S 26882 4.809820e-02 2.543457e-01 0.000000e+00 2.269721e+00 596 10513 16369
DT_FS 26882 6.856955e+01 4.827426e+00 5.070000e+01 8.510000e+01 2260 1718 25164
DT_LS 26882 7.040780e+01 5.088746e+00 5.050000e+01 8.580000e+01 2415 1718 25164
DT_SSL 26882 5.877077e+01 4.419236e+00 4.358000e+01 6.970000e+01 2175 1718 25164
DT_SSR 26882 5.354157e+01 3.564115e+00 4.070000e+01 6.600000e+01 2055 1718 25164
ENTZ__FS_ZW_F1__IR__S 26882 3.392100e-03 1.191530e-02 0.000000e+00 8.333330e-02 89 10513 16369
ENTZ__FS_ZW_F2__IR__S 26882 5.956500e-03 1.562020e-02 0.000000e+00 8.333330e-02 96 10513 16369
ENTZ__ZW_OF_AL__IR__S 26882 3.000000e-06 2.196000e-04 0.000000e+00 1.639340e-02 4 10513 16369
ENTZ__ZWR1_AL_SN2__IR__S 26882 1.111000e-04 2.261600e-03 0.000000e+00 6.666670e-02 23 10513 16369
FUELLSTAND 26882 7.497974e+01 6.737898e-01 7.200000e+01 7.850000e+01 97 1718 25164
KEIL25__FB__IR__S 26882 -2.246960e-02 8.339448e-01 -4.823868e+00 3.480015e+00 15190 10513 16369
KEIL40__FB__IR__S 26882 -1.268257e-01 7.662683e-01 -4.087126e+00 3.306016e+00 15755 10513 16369
KONI_LINKS 26882 1.119920e+01 6.437812e-01 3.700000e+00 1.867500e+01 204 1718 25164
Length.max.slab 26882 1.265407e+03 5.041442e+01 5.280000e+02 1.312000e+03 30 4034 22848
lTileID 26882 2.433464e+02 1.369660e+02 1.000000e+00 5.100000e+02 511 7 26875
MAT_IDENT 26882 2.985188e+07 3.937609e+05 2.917029e+07 3.077100e+07 657 0 26882
NETTO_PFANNENINHALT 26882 1.648394e+02 7.692643e+01 0.000000e+00 4.040000e+02 12751 1718 25164
PLATTENDICKE_SSL 26882 4.829999e+01 2.019073e+00 4.320000e+01 5.000000e+01 40 1718 25164
PLATTENDICKE_SSR 26882 4.745850e+01 2.089287e+00 4.329000e+01 5.000000e+01 35 1718 25164
POSITION_X.x 26882 5.979768e+02 3.447785e+02 5.175509e+00 1.841143e+03 25133 1718 25164
PR_40__FB__IR__S 26882 1.800993e+00 7.283150e-01 0.000000e+00 8.111780e+00 15842 10513 16369
RISS__HA_AS__IR__S 26882 3.270000e-05 3.355500e-03 0.000000e+00 4.107143e-01 4 10513 16369
RISS__HA_BS__IR__S 26882 1.509000e-04 8.996600e-03 0.000000e+00 7.083333e-01 9 10513 16369
STOPFENSTELLUNG 26882 5.489166e+01 5.534024e+00 4.300000e+01 7.000000e+01 262 1718 25164
STRANGBREITE 26882 2.490284e+03 1.190497e+02 2.151000e+03 2.577000e+03 118 1718 25164
TEMP__FB_1__IR__S 26882 4.488583e+01 1.228837e+01 0.000000e+00 1.951373e+02 16109 10513 16369
TEMP__FB_3__IR__S 26882 4.485969e+01 1.367100e+01 0.000000e+00 4.520970e+02 16111 10513 16369
TEMP__HA__SR__MAX 26882 3.561846e+01 4.902349e+01 0.000000e+00 6.400000e+02 127 10513 16369
TEMP__HA_1__IR__S 26882 3.137055e+01 8.701447e+00 0.000000e+00 1.549556e+02 16112 10513 16369
TEMP__HA_4__IR__S 26882 2.891099e+01 7.921209e+00 0.000000e+00 1.079578e+02 16118 10513 16369
TEMP__HA_5__IR__S 26882 3.086202e+01 8.576830e+00 0.000000e+00 1.451340e+02 16099 10513 16369
TM_FS_M 26882 1.288398e+02 1.178362e+01 9.320000e+01 1.566000e+02 5450 1718 25164
TM_FS_SSL 26882 1.277399e+02 1.056593e+01 9.333333e+01 1.597000e+02 5226 1718 25164
TM_FS_SSR 26882 1.285015e+02 1.120327e+01 9.915000e+01 1.614500e+02 5206 1718 25164
TM_LS_M 26882 1.265216e+02 1.003511e+01 8.960000e+01 1.510000e+02 4937 1718 25164
TM_LS_SSL 26882 1.318702e+02 1.032792e+01 1.014000e+02 1.628667e+02 4988 1718 25164
TM_LS_SSR 26882 1.321628e+02 1.085231e+01 9.990000e+01 1.691000e+02 5321 1718 25164
TM_SSL_FS 26882 1.453342e+02 1.446186e+01 1.041750e+02 1.791000e+02 5156 1718 25164
TM_SSL_LS 26882 1.329567e+02 1.021378e+01 1.073667e+02 1.645000e+02 4517 1718 25164
TM_SSR_FS 26882 1.376102e+02 1.083885e+01 1.031500e+02 1.767000e+02 4857 1718 25164
TM_SSR_LS 26882 1.301028e+02 8.983733e+00 1.016667e+02 1.598000e+02 4312 1718 25164
TO_FS_M 26882 1.846690e+02 1.613626e+01 1.339333e+02 2.244500e+02 6683 1718 25164
TO_FS_SSL 26882 1.885192e+02 2.793312e+01 8.000000e-01 7.977000e+02 6835 2620 24262
TO_FS_SSR 26882 1.913317e+02 1.379441e+01 1.441750e+02 2.262500e+02 6299 1718 25164
TO_LS_M 26882 1.863830e+02 1.294765e+01 1.388000e+02 2.216500e+02 5884 1718 25164
TO_LS_SSL 26882 1.961198e+02 1.250285e+01 1.511500e+02 2.289000e+02 5864 1718 25164
TO_LS_SSR 26882 1.949970e+02 1.264377e+01 1.455000e+02 2.248000e+02 5991 1718 25164
TO_SSL_FS 26882 2.052935e+02 1.576007e+01 1.476000e+02 2.410000e+02 6657 1718 25164
TO_SSL_LS 26882 2.015116e+02 1.541129e+01 1.503667e+02 2.377000e+02 6521 1718 25164
TO_SSR_FS 26882 2.030546e+02 1.587309e+01 1.448667e+02 2.412000e+02 6750 1718 25164
TO_SSR_LS 26882 1.983856e+02 1.474387e+01 1.408250e+02 2.367500e+02 6523 1718 25164
TU_FS_M 26882 1.116696e+02 7.413646e+00 8.015000e+01 1.304500e+02 4243 2298 24584
TU_FS_SSL 26882 1.087951e+02 7.716555e+00 8.368000e+01 1.342000e+02 4368 1718 25164
TU_FS_SSR 26882 1.089810e+02 7.231841e+00 8.084000e+01 1.348500e+02 4249 1718 25164
TU_LS_M 26882 1.086232e+02 6.924098e+00 6.540000e+01 1.294500e+02 4090 1959 24923
TU_LS_SSL 26882 1.152493e+02 7.128589e+00 8.065000e+01 1.382500e+02 4182 2736 24146
TU_LS_SSR 26882 1.150699e+02 7.318471e+00 8.833333e+01 1.394000e+02 4300 1959 24923
TU_SSL_FS 26882 1.257709e+02 9.163527e+00 8.836667e+01 1.524000e+02 4481 1718 25164
TU_SSL_LS 26882 1.153944e+02 7.976129e+00 9.183333e+01 1.436000e+02 4297 1718 25164
TU_SSR_FS 26882 1.221326e+02 8.008096e+00 8.710000e+01 1.467000e+02 4318 1718 25164
TU_SSR_LS 26882 1.141591e+02 7.792329e+00 8.290000e+01 1.485000e+02 4290 1718 25164
TUNDISH_POSITION 26882 1.271549e+01 1.156004e+01 0.000000e+00 4.200000e+01 32 1718 25164
V__FS_G2__IR__S 26882 9.235807e-01 2.339188e+00 0.000000e+00 1.247841e+01 2356 10513 16369
V__FS_G3__IR__S 26882 2.603245e+00 4.623162e+00 0.000000e+00 2.758184e+01 4205 10513 16369
V__FS_G4__IR__S 26882 6.294015e+00 7.883289e+00 0.000000e+00 4.301559e+01 6762 10513 16369
V__FS_G5__IR__S 26882 1.486168e+01 1.202346e+01 0.000000e+00 7.304973e+01 10559 10513 16369
V__FS_G6__IR__S 26882 2.531957e+01 1.339378e+01 0.000000e+00 9.872687e+01 13621 10513 16369
V__FS_G7__IR__S 26882 3.673649e+01 1.082059e+01 0.000000e+00 1.611927e+02 16100 10513 16369
VERTEILERFUELLSTAND 26882 7.882090e+01 1.623281e+00 6.132000e+01 8.246667e+01 844 1718 25164
VG 26882 9.885491e-01 9.614080e-02 7.610000e-01 1.157000e+00 819 1718 25164
VORBRAMME 26882 2.670925e+02 2.481209e+02 2.200000e+01 5.530000e+02 22 0 26882
WASSER_FS 26882 2.661484e+00 1.759780e-02 2.563600e+00 2.720500e+00 1765 1718 25164
WASSER_LS 26882 2.669746e+00 1.256150e-02 2.595750e+00 2.719500e+00 1399 1718 25164
WASSER_SSL 26882 2.626474e-01 8.790700e-03 2.450000e-01 2.782000e-01 352 1718 25164
WASSER_SSR 26882 2.572780e-01 1.135910e-02 2.370000e-01 2.810000e-01 535 1718 25164
WK__FS_G7__IR__S 26882 3.018373e+02 8.826540e+01 0.000000e+00 1.437853e+03 16169 10513 16369
WK__VS_SP_3__IR__S 26882 6.795180e-02 5.596544e-01 -3.138046e-01 1.225219e+01 396 10513 16369
WSPALT__FS_G7__IR__S 26882 1.647887e-01 4.480130e-02 0.000000e+00 6.212929e-01 16169 10513 16369

The working data set has now been reduced to 90 variables, all non-constant. Having reduced the numeric variables as much as seems required, these changes may be applied to the main data frame, df.

2.4 Applying Changes to Main Data Frame

#Reduce accordingly
df <- df %>%
  dplyr::select(-cor1.list, -cor.list, -var.sd0) %>%
  dplyr::rename(
    POSITION_X = POSITION_X.x
  )

#Dropping redundancies and piece related vars
df <- df %>%
  dplyr::select(-CHARGEN_NR, -VORBRAMME, -Length.max.slab) 

Our main data frame now has 88 variables and 26882 observations. With no additional motivation to reduce the data set further, the error rates can now be modeled, as a function of all remaining production variables, in order to determine which variables seem to have the most significant impact on the correct classification of a surface error.


2.5 Modeling the Error Rates

2.5.1 Log Error Rates

As the distributions of the error rates are highly skewed, in that, with a large number of observations there are relatively many zero values, one first applies a log transform to the (data +1), adjusting for the skew while simultaneously avoiding the issue of undefined log(0) option.

#Log transform of error counts
df1 <- df%>%
  dplyr::mutate(lnClass.4 = log(Class.4 + 1), 
                lnClass.14 = log(Class.14 + 1), 
                lnClass.15 = log(Class.15 + 1))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Above, the log transformation is applied to the data. The histograms above show the spread of the data before and after the transform. The greatest effect of this transform can be seen in the Class 4 errors, but does not cause notable changes in the spread of the other two error classes. As such, the log transformation will only be utilized with the Class 4 errors.

2.5.2 Class 4 Errors

2.5.2.1 Linear Model for Log(Class 4) Errors

linmodlnC4.pred <- lm(lnClass.4~.- CoilID - MAT_IDENT - lTileID - Class.4 - Class.14 - Class.15 - lnClass.14 - lnClass.15, data =df1)
## 
## Call:
## lm(formula = lnClass.4 ~ . - CoilID - MAT_IDENT - lTileID - Class.4 - 
##     Class.14 - Class.15 - lnClass.14 - lnClass.15, data = df1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.81393 -0.25878 -0.20246 -0.09486  3.02660 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               3.045e+00  1.386e+00   2.197 0.028004 *  
## POSITION_X               -2.254e-05  4.187e-05  -0.538 0.590412    
## VORG_HAUPTAGGREGATBRSG01 -1.344e-02  1.092e-02  -1.231 0.218241    
## TO_FS_SSL                 9.799e-05  1.542e-04   0.635 0.525236    
## TO_FS_M                  -1.340e-03  4.123e-04  -3.250 0.001158 ** 
## TO_FS_SSR                -7.721e-04  6.162e-04  -1.253 0.210224    
## TO_SSR_FS                 1.420e-03  8.712e-04   1.630 0.103048    
## TO_SSR_LS                -2.236e-03  8.180e-04  -2.734 0.006266 ** 
## TO_LS_SSR                 2.533e-04  6.496e-04   0.390 0.696632    
## TO_LS_M                   7.052e-04  5.013e-04   1.407 0.159508    
## TO_LS_SSL                 3.637e-04  6.891e-04   0.528 0.597618    
## TO_SSL_LS                -3.274e-04  8.249e-04  -0.397 0.691467    
## TO_SSL_FS                 4.086e-05  9.228e-04   0.044 0.964680    
## TM_FS_SSL                 5.095e-04  6.625e-04   0.769 0.441896    
## TM_FS_M                  -1.217e-03  6.516e-04  -1.867 0.061900 .  
## TM_FS_SSR                -7.560e-04  7.099e-04  -1.065 0.286942    
## TM_SSR_FS                 1.306e-03  9.854e-04   1.326 0.184911    
## TM_SSR_LS                -9.450e-04  1.156e-03  -0.817 0.413788    
## TM_LS_SSR                -1.378e-03  6.347e-04  -2.172 0.029905 *  
## TM_LS_M                  -2.125e-04  8.027e-04  -0.265 0.791170    
## TM_LS_SSL                 1.666e-04  6.990e-04   0.238 0.811601    
## TM_SSL_LS                -1.847e-03  9.021e-04  -2.047 0.040667 *  
## TM_SSL_FS                 2.446e-03  7.413e-04   3.299 0.000972 ***
## TU_FS_SSL                -7.550e-04  9.822e-04  -0.769 0.442102    
## TU_FS_M                   2.049e-03  1.153e-03   1.777 0.075521 .  
## TU_FS_SSR                -4.318e-04  1.107e-03  -0.390 0.696440    
## TU_SSR_FS                -4.822e-04  1.124e-03  -0.429 0.667897    
## TU_SSR_LS                -7.924e-04  1.195e-03  -0.663 0.507194    
## TU_LS_SSR                 2.063e-03  1.010e-03   2.042 0.041156 *  
## TU_LS_M                   3.187e-03  1.264e-03   2.521 0.011723 *  
## TU_LS_SSL                -1.052e-03  1.016e-03  -1.035 0.300674    
## TU_SSL_LS                 9.481e-04  1.117e-03   0.849 0.396103    
## TU_SSL_FS                -1.399e-03  1.112e-03  -1.258 0.208411    
## DT_SSR                   -3.572e-03  3.648e-03  -0.979 0.327524    
## DT_SSL                    2.018e-03  4.089e-03   0.494 0.621555    
## DT_FS                     8.808e-03  3.890e-03   2.264 0.023568 *  
## DT_LS                    -1.076e-02  4.267e-03  -2.522 0.011688 *  
## VG                        4.026e-01  1.336e-01   3.014 0.002582 ** 
## FUELLSTAND                5.246e-03  5.824e-03   0.901 0.367686    
## STRANGBREITE             -1.950e-05  1.092e-04  -0.179 0.858271    
## WASSER_SSR                1.582e+00  1.452e+00   1.089 0.275961    
## WASSER_SSL               -3.724e+00  1.763e+00  -2.112 0.034668 *  
## WASSER_FS                 1.568e-02  2.325e-01   0.067 0.946215    
## WASSER_LS                -6.082e-01  3.444e-01  -1.766 0.077401 .  
## STOPFENSTELLUNG          -2.706e-03  9.876e-04  -2.740 0.006157 ** 
## PLATTENDICKE_SSL         -1.105e-02  5.876e-03  -1.881 0.059994 .  
## PLATTENDICKE_SSR         -5.142e-03  6.111e-03  -0.841 0.400124    
## ARGON_DRUCK_ST           -3.654e-04  2.768e-04  -1.320 0.186863    
## ARGON_DURCHFL_ST          7.070e-04  7.740e-03   0.091 0.927226    
## ARGON_DURCHFL_DUSCH      -8.056e-04  2.050e-04  -3.929 8.56e-05 ***
## TUNDISH_POSITION          3.162e-04  4.550e-04   0.695 0.487003    
## VERTEILERFUELLSTAND       2.349e-03  3.525e-03   0.666 0.505260    
## NETTO_PFANNENINHALT      -8.501e-05  5.878e-05  -1.446 0.148108    
## KONI_LINKS               -9.322e-03  1.058e-02  -0.881 0.378509    
## ANST__VS_HG_3__IR__S      3.867e-02  4.738e-02   0.816 0.414367    
## DICKE__AL__IR__S         -2.305e-01  1.660e-01  -1.389 0.164917    
## DICKE__HA_1__IR__S       -2.860e-02  1.179e-01  -0.243 0.808334    
## DICKE__HA_2__IR__S        1.286e-01  1.423e-01   0.903 0.366483    
## DICKE__VB__IR__S         -2.147e-02  4.097e-02  -0.524 0.600294    
## ENTZ__FS_ZW_F1__IR__S     1.300e+00  5.197e-01   2.501 0.012409 *  
## ENTZ__FS_ZW_F2__IR__S     6.093e-01  6.017e-01   1.013 0.311270    
## ENTZ__ZWR1_AL_SN2__IR__S  5.213e+00  1.947e+00   2.678 0.007423 ** 
## ENTZ__ZW_OF_AL__IR__S     1.176e+01  1.744e+01   0.674 0.500182    
## KEIL25__FB__IR__S        -6.419e-03  6.359e-03  -1.009 0.312826    
## KEIL40__FB__IR__S         2.183e-02  6.616e-03   3.299 0.000973 ***
## PR_40__FB__IR__S         -5.267e-03  6.040e-03  -0.872 0.383208    
## RISS__HA_AS__IR__S       -9.389e-01  1.080e+00  -0.869 0.384858    
## RISS__HA_BS__IR__S       -6.366e-02  5.195e-01  -0.123 0.902476    
## TEMP__FB_1__IR__S         4.224e-04  3.939e-04   1.073 0.283508    
## TEMP__FB_3__IR__S        -2.760e-04  3.614e-04  -0.764 0.445018    
## TEMP__HA_1__IR__S         2.599e-03  7.809e-04   3.329 0.000875 ***
## TEMP__HA_4__IR__S        -4.546e-04  8.165e-04  -0.557 0.577678    
## TEMP__HA_5__IR__S        -1.721e-03  5.944e-04  -2.895 0.003796 ** 
## TEMP__HA__SR__MAX        -1.989e-05  9.419e-05  -0.211 0.832728    
## V__FS_G2__IR__S          -2.614e-03  4.086e-03  -0.640 0.522302    
## V__FS_G3__IR__S           1.530e-03  1.317e-03   1.162 0.245249    
## V__FS_G4__IR__S          -7.725e-04  7.813e-04  -0.989 0.322858    
## V__FS_G5__IR__S           7.209e-04  4.857e-04   1.484 0.137705    
## V__FS_G6__IR__S           6.037e-04  3.552e-04   1.699 0.089282 .  
## V__FS_G7__IR__S          -6.164e-04  1.133e-03  -0.544 0.586501    
## WK__FS_G7__IR__S          1.318e-04  1.231e-04   1.071 0.284151    
## WK__VS_SP_3__IR__S        1.660e-03  1.032e-02   0.161 0.872271    
## WSPALT__FS_G7__IR__S     -1.281e-01  2.922e-01  -0.438 0.661134    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4625 on 13963 degrees of freedom
##   (12836 observations deleted due to missingness)
## Multiple R-squared:  0.02408,    Adjusted R-squared:  0.01835 
## F-statistic: 4.202 on 82 and 13963 DF,  p-value: < 2.2e-16
ANOVA Results of Linear Model for Log Error count of Class 4 Errors
Df Sum Sq Mean Sq F value Pr(>F)
POSITION_X 1 12.1052258 12.1052258 56.6013309 0.0000000
VORG_HAUPTAGGREGAT 1 0.3064707 0.3064707 1.4329886 0.2312976
TO_FS_SSL 1 0.3044748 0.3044748 1.4236563 0.2328228
TO_FS_M 1 0.9983175 0.9983175 4.6679096 0.0307483
TO_FS_SSR 1 0.2452430 0.2452430 1.1467016 0.2842592
TO_SSR_FS 1 0.4734872 0.4734872 2.2139202 0.1367933
TO_SSR_LS 1 2.2104284 2.2104284 10.3354694 0.0013079
TO_LS_SSR 1 0.1913628 0.1913628 0.8947698 0.3442044
TO_LS_M 1 1.2948613 1.2948613 6.0544822 0.0138831
TO_LS_SSL 1 0.4928747 0.4928747 2.3045718 0.1290165
TO_SSL_LS 1 0.4538694 0.4538694 2.1221920 0.1452011
TO_SSL_FS 1 2.7332745 2.7332745 12.7801808 0.0003515
TM_FS_SSL 1 0.0882784 0.0882784 0.4127701 0.5205774
TM_FS_M 1 1.3246962 1.3246962 6.1939832 0.0128301
TM_FS_SSR 1 0.9366880 0.9366880 4.3797436 0.0363865
TM_SSR_FS 1 1.1467355 1.1467355 5.3618791 0.0205960
TM_SSR_LS 1 0.5801224 0.5801224 2.7125226 0.0995859
TM_LS_SSR 1 1.7378946 1.7378946 8.1260067 0.0043699
TM_LS_M 1 0.7700484 0.7700484 3.6005745 0.0577802
TM_LS_SSL 1 0.3290349 0.3290349 1.5384935 0.2148630
TM_SSL_LS 1 0.7101211 0.7101211 3.3203677 0.0684479
TM_SSL_FS 1 4.0556880 4.0556880 18.9634910 0.0000134
TU_FS_SSL 1 0.3663775 0.3663775 1.7130991 0.1906053
TU_FS_M 1 2.5541811 2.5541811 11.9427800 0.0005502
TU_FS_SSR 1 0.1974841 0.1974841 0.9233916 0.3366027
TU_SSR_FS 1 1.3126413 1.3126413 6.1376172 0.0132453
TU_SSR_LS 1 0.2708226 0.2708226 1.2663060 0.2604802
TU_LS_SSR 1 1.1134347 1.1134347 5.2061720 0.0225218
TU_LS_M 1 0.2590926 0.2590926 1.2114590 0.2710623
TU_LS_SSL 1 0.3614753 0.3614753 1.6901775 0.1935990
TU_SSL_LS 1 0.0086773 0.0086773 0.0405731 0.8403671
TU_SSL_FS 1 1.1712469 1.1712469 5.4764889 0.0192879
DT_SSR 1 0.8342697 0.8342697 3.9008587 0.0482811
DT_SSL 1 0.3408300 0.3408300 1.5936448 0.2068273
DT_FS 1 0.2942908 0.2942908 1.3760380 0.2407972
DT_LS 1 1.5307590 1.5307590 7.1574872 0.0074739
VG 1 3.8163329 3.8163329 17.8443201 0.0000241
FUELLSTAND 1 0.2139371 0.2139371 1.0003221 0.3172499
STRANGBREITE 1 0.1871895 0.1871895 0.8752564 0.3495204
WASSER_SSR 1 0.3589889 0.3589889 1.6785519 0.1951384
WASSER_SSL 1 0.4405181 0.4405181 2.0597643 0.1512560
WASSER_FS 1 0.0093231 0.0093231 0.0435930 0.8346158
WASSER_LS 1 0.8040779 0.8040779 3.7596887 0.0525225
STOPFENSTELLUNG 1 1.3107257 1.3107257 6.1286604 0.0133125
PLATTENDICKE_SSL 1 0.8622883 0.8622883 4.0318674 0.0446677
PLATTENDICKE_SSR 1 0.3175894 0.3175894 1.4849773 0.2230180
ARGON_DRUCK_ST 1 0.2258808 0.2258808 1.0561681 0.3041086
ARGON_DURCHFL_ST 1 0.0140627 0.0140627 0.0657542 0.7976259
ARGON_DURCHFL_DUSCH 1 3.0933738 3.0933738 14.4639247 0.0001435
TUNDISH_POSITION 1 0.0427591 0.0427591 0.1999318 0.6547829
VERTEILERFUELLSTAND 1 0.1813750 0.1813750 0.8480689 0.3571151
NETTO_PFANNENINHALT 1 0.3953274 0.3953274 1.8484625 0.1739843
KONI_LINKS 1 0.1736209 0.1736209 0.8118125 0.3676005
ANST__VS_HG_3__IR__S 1 3.5701146 3.5701146 16.6930581 0.0000442
DICKE__AL__IR__S 1 0.3173051 0.3173051 1.4836477 0.2232253
DICKE__HA_1__IR__S 1 0.0055638 0.0055638 0.0260150 0.8718660
DICKE__HA_2__IR__S 1 0.1528517 0.1528517 0.7147003 0.3979023
DICKE__VB__IR__S 1 0.1091491 0.1091491 0.5103566 0.4749965
ENTZ__FS_ZW_F1__IR__S 1 1.5400281 1.5400281 7.2008271 0.0072956
ENTZ__FS_ZW_F2__IR__S 1 0.0453203 0.0453203 0.2119077 0.6452834
ENTZ__ZWR1_AL_SN2__IR__S 1 2.1054425 2.1054425 9.8445785 0.0017069
ENTZ__ZW_OF_AL__IR__S 1 0.0949082 0.0949082 0.4437697 0.5053196
KEIL25__FB__IR__S 1 0.0723374 0.0723374 0.3382336 0.5608600
KEIL40__FB__IR__S 1 2.2036277 2.2036277 10.3036705 0.0013307
PR_40__FB__IR__S 1 0.0819987 0.0819987 0.3834077 0.5357952
RISS__HA_AS__IR__S 1 0.1979669 0.1979669 0.9256488 0.3360128
RISS__HA_BS__IR__S 1 0.0003147 0.0003147 0.0014716 0.9693997
TEMP__FB_1__IR__S 1 0.4173838 0.4173838 1.9515933 0.1624375
TEMP__FB_3__IR__S 1 0.0038659 0.0038659 0.0180761 0.8930506
TEMP__HA_1__IR__S 1 2.3284451 2.3284451 10.8872891 0.0009707
TEMP__HA_4__IR__S 1 0.0025675 0.0025675 0.0120051 0.9127539
TEMP__HA_5__IR__S 1 1.7216025 1.7216025 8.0498286 0.0045574
TEMP__HA__SR__MAX 1 0.0251762 0.0251762 0.1177183 0.7315276
V__FS_G2__IR__S 1 0.0906945 0.0906945 0.4240670 0.5149253
V__FS_G3__IR__S 1 0.0295655 0.0295655 0.1382416 0.7100411
V__FS_G4__IR__S 1 0.3805337 0.3805337 1.7792904 0.1822582
V__FS_G5__IR__S 1 0.7436703 0.7436703 3.4772362 0.0622395
V__FS_G6__IR__S 1 0.6349043 0.6349043 2.9686704 0.0849137
V__FS_G7__IR__S 1 0.0083502 0.0083502 0.0390438 0.8433649
WK__FS_G7__IR__S 1 0.2152435 0.2152435 1.0064305 0.3157769
WK__VS_SP_3__IR__S 1 0.0058208 0.0058208 0.0272168 0.8689660
WSPALT__FS_G7__IR__S 1 0.0410964 0.0410964 0.1921576 0.6611337
Residuals 13963 2986.2419272 0.2138682 NA NA

The first model used in this analysis is a linear model. The prediction equation used takes into account all variables, excluding those used for identification and those discussing the error counts themselves. Although the data is not expected to behave linearly, one still first applies a linear model, both to have a base line of possibly significant variables, and to review the usefullness of any specific predictor equations. As we are reviewing the data for all possible significant variables, we start with the full set of variables in our predictive equation, and reduce if necessary from there.

From the summary and ANOVA information presented above one can determine the variables which are statistically significant in predicting the occurance of log(class 4) errors, when the data is modeled using a linear model. Significant variables as according to this model are:

  • Highly Significant Variables (Pval < 0.001)
    • TEMP__HA_1__IR__S
    • KEIL40__FB__IR__S
    • ARGON_DURCHFL_DUSCH
    • TM_SSL_FS
  • Significant Variables (Pval <= 0.05)
    • TO_FS_M
    • TO_SSR_LS
    • VG
    • STOPFENSTELLUNG
    • TM_LS_SSR
    • TM_SSL_LS
    • TU_LS_SSR
    • TU_LS_M
    • DT_FS
    • DT_LS
    • WASSER_SSL
    • ENTZ__FS_ZW_F1__IR__S
    • ENTZ__ZWR1_AL_SN2__IR__S
    • TEMP__HA_5__IR__S

Although all of these variables are marked as significant by the model, the model is fitting very poorly to the data, with an adjusted R2 value of 0.018. As the linear model is explaining so little of the variability in the data, it is necessary to use other models going forward for selecting significant variables.

Tree models and random forest models will be used for all error classes to test for variable significance. For this work, conditional trees and conditional random forests have been used, by means of the ctree and cforest functions in the partykit and party packages, respectively. The party package can be utilized for both ctree and cforest, but the updated version, partykit, has improved upon the implementation of the old ctree function. Cforest is not yet fully developed in the partykit package, and is not used here. Conditional trees were chosen for this analysis to avoid the bais seen in rpart trees. Rpart trees tend to select node variables with the greatest potential for many splits, while conditional trees implement a selection algorithm specifically designed to avoid this bais.

2.5.2.2 Setting the Seed

set.seed(80542)

Here a random seed is set to allow for reproducable results.

2.5.3 Modeling Log Class 4 Error Counts w/ Conditional Tree

In order to model the log error counts with a conditional tree, the same prediction equation used in the linear model is implemented. As such, all variables are considered as predictors, excluding those used for identification and those variables which describe the error counts themselves. This predictive equation is saved as “lnC4.pred”, and is displayed below. A similar equation will be used in the analysis of each error class.

index <- createDataPartition(df1$lnClass.4, p=0.75, list=FALSE)
trainSet <- df1[ index,]
testSet <- df1[-index,]

predictors<- colnames(trainSet)
predictors <- predictors[!predictors %in% c("CoilID", "MAT_IDENT", "lTileID", "lnClass.4", "Class.4", "Class.14", "Class.15", "lnClass.14", "lnClass.15")]
lnC4.pred <- formula(paste("lnClass.4 ~ ", paste(predictors, collapse= " + ")))
lnC4.pred
## lnClass.4 ~ POSITION_X + VORG_HAUPTAGGREGAT + TO_FS_SSL + TO_FS_M + 
##     TO_FS_SSR + TO_SSR_FS + TO_SSR_LS + TO_LS_SSR + TO_LS_M + 
##     TO_LS_SSL + TO_SSL_LS + TO_SSL_FS + TM_FS_SSL + TM_FS_M + 
##     TM_FS_SSR + TM_SSR_FS + TM_SSR_LS + TM_LS_SSR + TM_LS_M + 
##     TM_LS_SSL + TM_SSL_LS + TM_SSL_FS + TU_FS_SSL + TU_FS_M + 
##     TU_FS_SSR + TU_SSR_FS + TU_SSR_LS + TU_LS_SSR + TU_LS_M + 
##     TU_LS_SSL + TU_SSL_LS + TU_SSL_FS + DT_SSR + DT_SSL + DT_FS + 
##     DT_LS + VG + FUELLSTAND + STRANGBREITE + WASSER_SSR + WASSER_SSL + 
##     WASSER_FS + WASSER_LS + STOPFENSTELLUNG + PLATTENDICKE_SSL + 
##     PLATTENDICKE_SSR + ARGON_DRUCK_ST + ARGON_DURCHFL_ST + ARGON_DURCHFL_DUSCH + 
##     TUNDISH_POSITION + VERTEILERFUELLSTAND + NETTO_PFANNENINHALT + 
##     KONI_LINKS + ANST__VS_HG_3__IR__S + DICKE__AL__IR__S + DICKE__HA_1__IR__S + 
##     DICKE__HA_2__IR__S + DICKE__VB__IR__S + ENTZ__FS_ZW_F1__IR__S + 
##     ENTZ__FS_ZW_F2__IR__S + ENTZ__ZWR1_AL_SN2__IR__S + ENTZ__ZW_OF_AL__IR__S + 
##     KEIL25__FB__IR__S + KEIL40__FB__IR__S + PR_40__FB__IR__S + 
##     RISS__HA_AS__IR__S + RISS__HA_BS__IR__S + TEMP__FB_1__IR__S + 
##     TEMP__FB_3__IR__S + TEMP__HA_1__IR__S + TEMP__HA_4__IR__S + 
##     TEMP__HA_5__IR__S + TEMP__HA__SR__MAX + V__FS_G2__IR__S + 
##     V__FS_G3__IR__S + V__FS_G4__IR__S + V__FS_G5__IR__S + V__FS_G6__IR__S + 
##     V__FS_G7__IR__S + WK__FS_G7__IR__S + WK__VS_SP_3__IR__S + 
##     WSPALT__FS_G7__IR__S
output.tree <- partykit::ctree(lnC4.pred, data = trainSet)
png("anna_tks_tree10.png", res=80, height=800, width=1600)
plot(output.tree)
dev.off()
## png 
##   2
print(output.tree)
## 
## Model formula:
## lnClass.4 ~ POSITION_X + VORG_HAUPTAGGREGAT + TO_FS_SSL + TO_FS_M + 
##     TO_FS_SSR + TO_SSR_FS + TO_SSR_LS + TO_LS_SSR + TO_LS_M + 
##     TO_LS_SSL + TO_SSL_LS + TO_SSL_FS + TM_FS_SSL + TM_FS_M + 
##     TM_FS_SSR + TM_SSR_FS + TM_SSR_LS + TM_LS_SSR + TM_LS_M + 
##     TM_LS_SSL + TM_SSL_LS + TM_SSL_FS + TU_FS_SSL + TU_FS_M + 
##     TU_FS_SSR + TU_SSR_FS + TU_SSR_LS + TU_LS_SSR + TU_LS_M + 
##     TU_LS_SSL + TU_SSL_LS + TU_SSL_FS + DT_SSR + DT_SSL + DT_FS + 
##     DT_LS + VG + FUELLSTAND + STRANGBREITE + WASSER_SSR + WASSER_SSL + 
##     WASSER_FS + WASSER_LS + STOPFENSTELLUNG + PLATTENDICKE_SSL + 
##     PLATTENDICKE_SSR + ARGON_DRUCK_ST + ARGON_DURCHFL_ST + ARGON_DURCHFL_DUSCH + 
##     TUNDISH_POSITION + VERTEILERFUELLSTAND + NETTO_PFANNENINHALT + 
##     KONI_LINKS + ANST__VS_HG_3__IR__S + DICKE__AL__IR__S + DICKE__HA_1__IR__S + 
##     DICKE__HA_2__IR__S + DICKE__VB__IR__S + ENTZ__FS_ZW_F1__IR__S + 
##     ENTZ__FS_ZW_F2__IR__S + ENTZ__ZWR1_AL_SN2__IR__S + ENTZ__ZW_OF_AL__IR__S + 
##     KEIL25__FB__IR__S + KEIL40__FB__IR__S + PR_40__FB__IR__S + 
##     RISS__HA_AS__IR__S + RISS__HA_BS__IR__S + TEMP__FB_1__IR__S + 
##     TEMP__FB_3__IR__S + TEMP__HA_1__IR__S + TEMP__HA_4__IR__S + 
##     TEMP__HA_5__IR__S + TEMP__HA__SR__MAX + V__FS_G2__IR__S + 
##     V__FS_G3__IR__S + V__FS_G4__IR__S + V__FS_G5__IR__S + V__FS_G6__IR__S + 
##     V__FS_G7__IR__S + WK__FS_G7__IR__S + WK__VS_SP_3__IR__S + 
##     WSPALT__FS_G7__IR__S
## 
## Fitted party:
## [1] root
## |   [2] VG <= 0.9956
## |   |   [3] ANST__VS_HG_3__IR__S <= 0.94633
## |   |   |   [4] V__FS_G5__IR__S <= 17.89324: 0.203 (n = 6944, err = 1397.3)
## |   |   |   [5] V__FS_G5__IR__S > 17.89324
## |   |   |   |   [6] TM_LS_SSR <= 134.1
## |   |   |   |   |   [7] POSITION_X <= 75.99812: 0.418 (n = 94, err = 36.3)
## |   |   |   |   |   [8] POSITION_X > 75.99812
## |   |   |   |   |   |   [9] KEIL40__FB__IR__S <= -0.75525: 0.176 (n = 865, err = 140.0)
## |   |   |   |   |   |   [10] KEIL40__FB__IR__S > -0.75525
## |   |   |   |   |   |   |   [11] POSITION_X <= 509.71317: 0.279 (n = 2044, err = 520.3)
## |   |   |   |   |   |   |   [12] POSITION_X > 509.71317: 0.207 (n = 1211, err = 225.4)
## |   |   |   |   [13] TM_LS_SSR > 134.1: 0.184 (n = 2167, err = 365.2)
## |   |   [14] ANST__VS_HG_3__IR__S > 0.94633: 0.303 (n = 426, err = 120.4)
## |   [15] VG > 0.9956
## |   |   [16] TO_SSL_FS <= 232.65
## |   |   |   [17] TEMP__HA_1__IR__S <= 27.29227
## |   |   |   |   [18] ENTZ__FS_ZW_F1__IR__S <= 0.04688: 0.240 (n = 2294, err = 495.0)
## |   |   |   |   [19] ENTZ__FS_ZW_F1__IR__S > 0.04688: 0.463 (n = 82, err = 28.1)
## |   |   |   [20] TEMP__HA_1__IR__S > 27.29227: 0.273 (n = 3325, err = 776.5)
## |   |   [21] TO_SSL_FS > 232.65: 0.454 (n = 710, err = 239.3)
## 
## Number of inner nodes:    10
## Number of terminal nodes: 11
plot(output.tree, 
     main = "Log Class 4 Error Counts Tree",
     gp = gpar(fontsize = 10),
     inner_panel=node_inner,
     ip_args=list(abbreviate = FALSE, id = FALSE)
     )

Above one can see both the r output describing the conditional tree, and the tree plot. One can determine that, according to the conditional tree model, the significant variables for predicting a Class 4 errror are:

  • VG
  • ENTZ__FS_ZW_F1__IR__S
  • KEIL40__FB__IR__S
  • TM_LS_SSL
  • TM_FS_SSR
  • V__FS_G5__IR__S
  • TEMP__HA__SR__MAX
  • TM_SSL_LS

These are listed without repetition, although within the tree VG is both the root node and an inner node. With regard to the plotted tree, the terminal nodes show a box plot of the observations in each node. For the nodes which do not present an obvious “box”, the implication is that there are so many 0 observations in the node that the inner quartile range has compressed around 0. As such, those nodes with observable IQR ranges in the box plot can be known to contain more observations away from 0 — ie errors. As such, one can see that, with regard to relative distributions, those nodes which contain the most error values are nodes 13, 15, 16, and 18. Variables which dictated the creation of these terminal nodes are:

  • VG
  • ENTZ__FS_ZW_F1__IR__S
  • TM_SSL_LS (<= 0.427)
  • KEIL40__FB__IR__S (> -0.0648)
  • TEMP__HA__SR__MAX
  • And, again, VG (>20.22)

2.5.4 Modeling Class 4 Error Counts w/ Conditional Tree - Binary Class 4

In the above tree, the outcome variable, log(Class 4) errors, has multiple observable values. These different values most likely refer to slight variations in observed class 4 errors, or to changes in severity. As the goal of this analysis is to find variables linked to the presence of any error, regardless of severity, size, or other specifying qualities, it may be better to create a binary interpretation of the Class 4 error variable. This will allow the model to predict for the pure error rate, rather than forcing it to account for multiple levels of error. Below, a new variable is created from the original class 4 error data, providing a binary interpretation of the error rates.

df1$C4 <- with(df1, Class.4>0)
df1$C4<-factor(df1$C4, levels=c(FALSE,TRUE), labels=c("no.error", "error"))

prop.table(table(df1$C4))
## 
##  no.error     error 
## 0.7651588 0.2348412

From the above proportion table we can see that, under the binary interpretation, our data split 76.52% “no Class 4 error” and 23.48% “Class 4 error”.

index <- createDataPartition(df1$C4, p=0.75, list=FALSE)

trainSet <- df1[ index,]
testSet <- df1[-index,]

C4.pred <- formula(paste("C4 ~ ", paste(predictors, collapse= " + ")))
output.tree <- partykit::ctree(C4.pred, data = trainSet)
png("anna_tks_tree20.png", res=80, height=800, width=1600)
plot(output.tree)
dev.off()
## png 
##   2
print(output.tree)
## 
## Model formula:
## C4 ~ POSITION_X + VORG_HAUPTAGGREGAT + TO_FS_SSL + TO_FS_M + 
##     TO_FS_SSR + TO_SSR_FS + TO_SSR_LS + TO_LS_SSR + TO_LS_M + 
##     TO_LS_SSL + TO_SSL_LS + TO_SSL_FS + TM_FS_SSL + TM_FS_M + 
##     TM_FS_SSR + TM_SSR_FS + TM_SSR_LS + TM_LS_SSR + TM_LS_M + 
##     TM_LS_SSL + TM_SSL_LS + TM_SSL_FS + TU_FS_SSL + TU_FS_M + 
##     TU_FS_SSR + TU_SSR_FS + TU_SSR_LS + TU_LS_SSR + TU_LS_M + 
##     TU_LS_SSL + TU_SSL_LS + TU_SSL_FS + DT_SSR + DT_SSL + DT_FS + 
##     DT_LS + VG + FUELLSTAND + STRANGBREITE + WASSER_SSR + WASSER_SSL + 
##     WASSER_FS + WASSER_LS + STOPFENSTELLUNG + PLATTENDICKE_SSL + 
##     PLATTENDICKE_SSR + ARGON_DRUCK_ST + ARGON_DURCHFL_ST + ARGON_DURCHFL_DUSCH + 
##     TUNDISH_POSITION + VERTEILERFUELLSTAND + NETTO_PFANNENINHALT + 
##     KONI_LINKS + ANST__VS_HG_3__IR__S + DICKE__AL__IR__S + DICKE__HA_1__IR__S + 
##     DICKE__HA_2__IR__S + DICKE__VB__IR__S + ENTZ__FS_ZW_F1__IR__S + 
##     ENTZ__FS_ZW_F2__IR__S + ENTZ__ZWR1_AL_SN2__IR__S + ENTZ__ZW_OF_AL__IR__S + 
##     KEIL25__FB__IR__S + KEIL40__FB__IR__S + PR_40__FB__IR__S + 
##     RISS__HA_AS__IR__S + RISS__HA_BS__IR__S + TEMP__FB_1__IR__S + 
##     TEMP__FB_3__IR__S + TEMP__HA_1__IR__S + TEMP__HA_4__IR__S + 
##     TEMP__HA_5__IR__S + TEMP__HA__SR__MAX + V__FS_G2__IR__S + 
##     V__FS_G3__IR__S + V__FS_G4__IR__S + V__FS_G5__IR__S + V__FS_G6__IR__S + 
##     V__FS_G7__IR__S + WK__FS_G7__IR__S + WK__VS_SP_3__IR__S + 
##     WSPALT__FS_G7__IR__S
## 
## Fitted party:
## [1] root
## |   [2] VG <= 0.982
## |   |   [3] ENTZ__FS_ZW_F1__IR__S <= 0.02941
## |   |   |   [4] WASSER_SSR <= 0.2632: no.error (n = 9064, err = 19.3%)
## |   |   |   [5] WASSER_SSR > 0.2632: no.error (n = 3673, err = 23.7%)
## |   |   [6] ENTZ__FS_ZW_F1__IR__S > 0.02941
## |   |   |   [7] POSITION_X <= 79.28875: no.error (n = 386, err = 33.4%)
## |   |   |   [8] POSITION_X > 79.28875: no.error (n = 549, err = 22.2%)
## |   [9] VG > 0.982
## |   |   [10] TU_LS_M <= 115.35: no.error (n = 4275, err = 26.3%)
## |   |   [11] TU_LS_M > 115.35
## |   |   |   [12] VG <= 1.138
## |   |   |   |   [13] TM_LS_SSL <= 136.15
## |   |   |   |   |   [14] DICKE__AL__IR__S <= 0.17795: no.error (n = 419, err = 37.5%)
## |   |   |   |   |   [15] DICKE__AL__IR__S > 0.17795
## |   |   |   |   |   |   [16] TO_SSL_FS <= 210.8: no.error (n = 61, err = 11.5%)
## |   |   |   |   |   |   [17] TO_SSL_FS > 210.8: no.error (n = 171, err = 33.3%)
## |   |   |   |   [18] TM_LS_SSL > 136.15: no.error (n = 673, err = 24.2%)
## |   |   |   [19] VG > 1.138: no.error (n = 891, err = 39.3%)
## 
## Number of inner nodes:     9
## Number of terminal nodes: 10
plot(output.tree,
     main = "Binary Class 4 Error Counts Conditional Tree",
     gp = gpar(fontsize = 10),
     inner_panel=node_inner,
     ip_args=list(abbreviate = FALSE, id = FALSE)
     )

Now with a binary interpretation of Class 4 errors, the printed tree output and the plotted tree can be seen above. Where before the terminal panels showed a box plot of the data in each node, here instead is a bar chart, showing the proportion of errors vs non-errors in each terminal node. It is clear from observing the tree above that the nodes with the greatest proportion of errors are nodes 7 (34.2% errors), 18 (30.8% errors), 26 (33.5% errors), and 27 (39.1% errors). Reviewing the internal nodes, one can see the significant varaibles are:

  • VG
  • TM_LS_SSL
  • V__FS_G5__IR__S
  • POSITION_X
  • TEMP__HA_5__IR__S
  • TM_FS_SSR
  • STOPFENSTELLUNG
  • DICKE__VB__IR__S
  • WASSER_LS
  • TO_SSL_FS
  • ENTZ__FS_ZW_F1__IR__S
  • TUNDISH_POSITION

Although not amoung those variables marked as of highest significance in the linear model, VG has now appeared as the most significant variable for both the binary tree, and for the tree describing the log(Class 4) error rates. But, having run these code chunks multiple times, it has been noted that the structure of the above binary tree varies widely between each run. As such, one progresses to the use of a random forest.

2.5.5 Modeling Class 4 w/ Conditional Random Forest

In the initial run of this analysis, it was noted that there was a large amount of variation between each run, and that, without a stable seed, the results from independant runs were practically non-comparable. In part, the large amount of variability seen from model to model is due to the default construction of the cforest function.

The standard random forest model considers a default number of variables at each split in each tree, selecting randomly this amount of variables from the total avaliable. It also allows for the adjustment of the number of variables considered at each split within each tree through the use of the MTRY parameter, such that it can be increased for data sets with large numbers of variables. a specific tuning function used to ascertain the best value for the MTRY parameter iis also avliable in the randomForest package. This parameter is set internally in the cforest function to 5. It may be changed using the cforest_control function, but the party package does not come with a tuning function for this parameter. As such, the parameter would need to be adjusted by trail and error, which is why, for this analysis, only the number of trees was increased and the MTRY parameter was left at the default level.

Given that the df data frame used in the above code chunk to produce the conditional random forest contains 94 variables, many more trees are needed to consider all possible split options when MTRY is only set to 5. As such, the main alteration made to this analysis from that of Prof. Wilhelm is that the number of trees in the model above, and all following cforest models, has been increased to 1000. Tree counts of both 500 and 800 were also tested, but the amount of variability in the output models was not sufficiently decreased such as to review for significant variables. When the above model was grown with a larger amount of trees in the forest, a notable pattern of significant variables began to arise.

dev.off()
## null device 
##           1
trainSet2 <- trainSet[sample(1:nrow(trainSet), 10000,
    replace=FALSE),]
output.forest <- party::cforest(C4.pred, data = trainSet2, control = cforest_control(ntree =1000))
var.imp.c4 <- party::varimp(output.forest)
png("anna_tks_tree20varimp.png", res=80, height=800, width=1600)
barchart(tail(sort(var.imp.c4), 40), xlab="Variable Importance", main="Variable Importance for Class 4 Errors")
var.imp.c4.1 <- var.imp.c4

Rather than examining the structure of any trees in the forest, displayed above is the variable importance for the Class 4 random forest. This variable importance chart will be used to review for the most significant variables in the forest. All variable importances computed in this analysis follow the permutation principle of the “mean decrease in accuracy”, which means that, the more the mean accuracy of the random forest decreases as caused by the removal or permutation of a variable, the more important that variable is deemed to be. One sees imediately that the variable deemed to be most significant in classifying Class 4 errors is here, again, VG. But, as the trees from before were suffereing from large amounts of variation, one tests the random forest model above for the same weaknesses below.

output.forest <- party::cforest(C4.pred, data = trainSet2, control = cforest_control(ntree =1000))
var.imp.c4 <- party::varimp(output.forest)
png("anna_tks_tree20varimp.2.png", res=80, height=800, width=1600)
barchart(tail(sort(var.imp.c4), 40), xlab="Variable Importance", main="Variable Importance for Class 4 Errors")
var.imp.c4.2 <- var.imp.c4

output.forest <- party::cforest(C4.pred, data = trainSet2, control = cforest_control(ntree =1000))
var.imp.c4 <- party::varimp(output.forest)
png("anna_tks_tree20varimp.3.png", res=80, height=800, width=1600)
barchart(tail(sort(var.imp.c4), 40), xlab="Variable Importance", main="Variable Importance for Class 4 Errors")
var.imp.c4.3 <- var.imp.c4

The above code chunk shows the creation of two additional random forest models, all grown from the same training split, and their corresponding variable importance calculations. amongst the three iterations of the model those variables which appeared within the top 20 most important variables for all three models are:

##  [1] "TM_FS_SSL"       "DT_FS"           "TU_FS_M"         "DT_SSL"         
##  [5] "TM_LS_SSR"       "TM_LS_SSL"       "V__FS_G4__IR__S" "TO_SSL_LS"      
##  [9] "TO_LS_SSL"       "TM_SSR_FS"       "KONI_LINKS"      "STRANGBREITE"   
## [13] "TO_SSL_FS"       "WASSER_SSR"      "WASSER_SSL"      "VG"

Note that these variables are listed in order of increasing average importance across all three models. As such, VG is the most significant, on average, for all three models, followed by WASSER_SSR, TU_LS_M, and so on. In all three models, with the first of three used to generate the barchart above, the most important variable for accurate classification was VG, followed by WASSER_SSR. Variable Importance barcharts for all three models, and all future models, have been saved with the knitting of this document.

When considering the top 40 variables, the same amount considered for the barchart as a whole, the following variables were marked of highest importance.

##  [1] "TU_FS_SSR"        "TM_LS_M"          "TO_FS_SSL"        "POSITION_X"      
##  [5] "TM_FS_SSR"        "PLATTENDICKE_SSL" "PLATTENDICKE_SSR" "DT_LS"           
##  [9] "TU_SSL_FS"        "TM_SSL_FS"        "TM_FS_SSL"        "TO_SSR_FS"       
## [13] "DT_FS"            "TU_FS_M"          "DT_SSL"           "TM_LS_SSR"       
## [17] "TM_LS_SSL"        "TO_FS_SSR"        "TM_SSR_LS"        "V__FS_G4__IR__S" 
## [21] "TO_SSL_LS"        "TO_LS_SSL"        "TM_SSR_FS"        "KONI_LINKS"      
## [25] "STRANGBREITE"     "TO_SSL_FS"        "WASSER_SSR"       "WASSER_SSL"      
## [29] "VG"

Again, these variables are listed in order of increasing average importance for all three models.

df1 <- df1 %>%
  dplyr::select(-C4)

2.6 Class 14 Errors

In all remaining models, the use of log error counts is eschewed in favor of the binary transformation.

2.6.1 Modeling Class 14 Error Counts w/ Conditional Tree

df1$C14 <- with(df1, Class.14>0)
df1$C14<-factor(df1$C14, levels=c(FALSE,TRUE), labels=c("no.error", "error"))

prop.table(table(df1$C14))
## 
##    no.error       error 
## 0.991890484 0.008109516

From the above proportion table, one can see that our data is 99.19% error free, and contains only 0.81% error observations.

index <- createDataPartition(df1$C14, p=0.75, list=FALSE)
trainSet <- df1[ index,]
testSet <- df1[-index,]


predictors<- colnames(trainSet)
predictors <- predictors[!predictors %in% c("CoilID", "MAT_IDENT", "lTileID", "lnClass.4", "Class.4", "Class.14", "Class.15", "lnClass.14", "lnClass.15", "C4", "C14")]
C14.pred <- formula(paste("C14 ~ ", paste(predictors, collapse= " + ")))
output.tree.c14 <- partykit::ctree(C14.pred, data = trainSet)
png("anna_tks_tree_c14_2.png", res=80, height=800, width=1600) 
plot(output.tree.c14)
dev.off()
## png 
##   2
print(output.tree.c14)
## 
## Model formula:
## C14 ~ POSITION_X + VORG_HAUPTAGGREGAT + TO_FS_SSL + TO_FS_M + 
##     TO_FS_SSR + TO_SSR_FS + TO_SSR_LS + TO_LS_SSR + TO_LS_M + 
##     TO_LS_SSL + TO_SSL_LS + TO_SSL_FS + TM_FS_SSL + TM_FS_M + 
##     TM_FS_SSR + TM_SSR_FS + TM_SSR_LS + TM_LS_SSR + TM_LS_M + 
##     TM_LS_SSL + TM_SSL_LS + TM_SSL_FS + TU_FS_SSL + TU_FS_M + 
##     TU_FS_SSR + TU_SSR_FS + TU_SSR_LS + TU_LS_SSR + TU_LS_M + 
##     TU_LS_SSL + TU_SSL_LS + TU_SSL_FS + DT_SSR + DT_SSL + DT_FS + 
##     DT_LS + VG + FUELLSTAND + STRANGBREITE + WASSER_SSR + WASSER_SSL + 
##     WASSER_FS + WASSER_LS + STOPFENSTELLUNG + PLATTENDICKE_SSL + 
##     PLATTENDICKE_SSR + ARGON_DRUCK_ST + ARGON_DURCHFL_ST + ARGON_DURCHFL_DUSCH + 
##     TUNDISH_POSITION + VERTEILERFUELLSTAND + NETTO_PFANNENINHALT + 
##     KONI_LINKS + ANST__VS_HG_3__IR__S + DICKE__AL__IR__S + DICKE__HA_1__IR__S + 
##     DICKE__HA_2__IR__S + DICKE__VB__IR__S + ENTZ__FS_ZW_F1__IR__S + 
##     ENTZ__FS_ZW_F2__IR__S + ENTZ__ZWR1_AL_SN2__IR__S + ENTZ__ZW_OF_AL__IR__S + 
##     KEIL25__FB__IR__S + KEIL40__FB__IR__S + PR_40__FB__IR__S + 
##     RISS__HA_AS__IR__S + RISS__HA_BS__IR__S + TEMP__FB_1__IR__S + 
##     TEMP__FB_3__IR__S + TEMP__HA_1__IR__S + TEMP__HA_4__IR__S + 
##     TEMP__HA_5__IR__S + TEMP__HA__SR__MAX + V__FS_G2__IR__S + 
##     V__FS_G3__IR__S + V__FS_G4__IR__S + V__FS_G5__IR__S + V__FS_G6__IR__S + 
##     V__FS_G7__IR__S + WK__FS_G7__IR__S + WK__VS_SP_3__IR__S + 
##     WSPALT__FS_G7__IR__S
## 
## Fitted party:
## [1] root: no.error (n = 20162, err = 0.8%) 
## 
## Number of inner nodes:    0
## Number of terminal nodes: 1
plot(output.tree.c14, 
     main = "Log Class 14 Error Counts Tree",
     gp = gpar(fontsize = 10),
     inner_panel=node_inner,
     ip_args=list(abbreviate = FALSE, id = FALSE)
     )

The tree above has selected 5 different variables for its nodes. These are:

  • WASSER_SSL
  • TU_FS_M
  • TM_FS_SSR
  • VERTEILERFUELLSTAND
  • TU_SSL_FS

Although these variables are selected here as significant, as with previous tree models in this analysis, the structure is highly volitile under different seeds. This may in part be due to the incredibly low rate of error observations in the data, paired with the relatively large number of variables. To cope with the notable variation between tree structures, the below conditional random forest is grown, again with the number of trees to be grown set to 1000.

2.6.2 Modeling Log Type 14 Error Counts w/ Conditional random forest

trainSet2 <- trainSet[sample(1:nrow(trainSet), 10000,
    replace=FALSE),]
output.rf.c14 <- party::cforest(C14.pred, data = trainSet2, control = cforest_control(ntree =1000))
var.imp.c14 <- party::varimp(output.rf.c14)
png("anna_tks_rfc14varimp.1.png", res=80, height=800, width=1600) 
barchart(tail(sort(var.imp.c14), 40), xlab="Variable Importance", main="Variable Importance Fehlertyp 14")
var.imp.c14.1 <- var.imp.c14

The above bar chart was constructed using the variable importance measures for the first of three conditional random forests, generated on the training split for the Class 14 errors. Just as with the Class 4 errors, three seperate random forest models were grown for the final analysis, in order to compare their results for variability.

output.rf.c14 <- party::cforest(C14.pred, data = trainSet2, control = cforest_control(ntree =1000))
var.imp.c14 <- party::varimp(output.rf.c14)
png("anna_tks_rfc14varimp.2.png", res=80, height=800, width=1600) 
barchart(tail(sort(var.imp.c14), 40), xlab="Variable Importance", main="Variable Importance Fehlertyp 14")
var.imp.c14.2 <- var.imp.c14

output.rf.c14 <- party::cforest(C14.pred, data = trainSet2, control = cforest_control(ntree =1000))
var.imp.c14 <- party::varimp(output.rf.c14)
png("anna_tks_rfc14varimp.3.png", res=80, height=800, width=1600) 
barchart(tail(sort(var.imp.c14), 40), xlab="Variable Importance", main="Variable Importance Fehlertyp 14")
var.imp.c14.3 <- var.imp.c14

The above chunk shows the construction of the additional two random forest models, trained on the same training split as the first model, whose variable importance is presented above.

When comparing across all three models, the 20 most important variables with regard to mean decease in accuracy are:

## [1] "TM_LS_SSL"    "STRANGBREITE" "TU_FS_M"      "TO_LS_SSR"    "TM_LS_M"     
## [6] "WASSER_SSL"   "TO_LS_SSL"    "TU_LS_M"

Again, these variables are listed in order of increasing average importance across all three models.

When considering the top 40 most important variables for each model, as the barchart above shows for model 1, one finds the following variables to be shared by all three models.

##  [1] "PLATTENDICKE_SSL"    "TO_SSL_LS"           "TM_FS_M"            
##  [4] "TM_SSL_LS"           "TU_LS_SSR"           "TM_SSR_LS"          
##  [7] "TU_SSR_FS"           "DT_SSR"              "TM_LS_SSR"          
## [10] "TM_SSR_FS"           "TO_SSR_FS"           "TO_FS_SSR"          
## [13] "VERTEILERFUELLSTAND" "WASSER_SSR"          "ARGON_DRUCK_ST"     
## [16] "VG"                  "TM_LS_SSL"           "TU_SSL_LS"          
## [19] "STRANGBREITE"        "TU_FS_M"             "TM_SSL_FS"          
## [22] "TU_SSR_LS"           "KONI_LINKS"          "TO_LS_M"            
## [25] "TO_LS_SSR"           "DT_SSL"              "TM_LS_M"            
## [28] "TO_FS_SSL"           "WASSER_SSL"          "DT_FS"              
## [31] "DT_LS"               "TO_LS_SSL"           "TU_LS_M"
df1 <- df1 %>%
  dplyr::select(-C14)

2.7 Class 15 Errors

2.7.1 Modeling Class 15 Error Counts w/ Conditional Tree

df1$C15 <- with(df1, Class.15>0)
df1$C15<-factor(df1$C15, levels=c(FALSE,TRUE), labels=c("no.error", "error"))

prop.table(table(df1$C15))
## 
##    no.error       error 
## 0.992634477 0.007365523

As with the Class 14 errors, there is an extremely low observance rate of Class 15 errors in our data set. Only 0.74% of our observations are Class 15 errors, while 99.26% are error free.

index <- createDataPartition(df1$C15, p=0.75, list=FALSE)
trainSet <- df1[ index,]
testSet <- df1[-index,]

outcomeName<-'C15'
predictors <- predictors[!predictors %in% c("CoilID", "MAT_IDENT", "lTileID", "lnClass.4", "Class.4", "Class.14", "Class.15", "lnClass.14", "lnClass.15", "C4", "C14", "C15")]

C15.pred <- formula(paste("C15 ~ ", paste(predictors, collapse= " + ")))
output.tree.c15 <- partykit::ctree(C15.pred, data = trainSet)
png("anna_tks_tree_c15_2.png", res=80, height=800, width=1600) 
plot(output.tree.c15)
dev.off()
## png 
##   2
print(output.tree.c15)
## 
## Model formula:
## C15 ~ POSITION_X + VORG_HAUPTAGGREGAT + TO_FS_SSL + TO_FS_M + 
##     TO_FS_SSR + TO_SSR_FS + TO_SSR_LS + TO_LS_SSR + TO_LS_M + 
##     TO_LS_SSL + TO_SSL_LS + TO_SSL_FS + TM_FS_SSL + TM_FS_M + 
##     TM_FS_SSR + TM_SSR_FS + TM_SSR_LS + TM_LS_SSR + TM_LS_M + 
##     TM_LS_SSL + TM_SSL_LS + TM_SSL_FS + TU_FS_SSL + TU_FS_M + 
##     TU_FS_SSR + TU_SSR_FS + TU_SSR_LS + TU_LS_SSR + TU_LS_M + 
##     TU_LS_SSL + TU_SSL_LS + TU_SSL_FS + DT_SSR + DT_SSL + DT_FS + 
##     DT_LS + VG + FUELLSTAND + STRANGBREITE + WASSER_SSR + WASSER_SSL + 
##     WASSER_FS + WASSER_LS + STOPFENSTELLUNG + PLATTENDICKE_SSL + 
##     PLATTENDICKE_SSR + ARGON_DRUCK_ST + ARGON_DURCHFL_ST + ARGON_DURCHFL_DUSCH + 
##     TUNDISH_POSITION + VERTEILERFUELLSTAND + NETTO_PFANNENINHALT + 
##     KONI_LINKS + ANST__VS_HG_3__IR__S + DICKE__AL__IR__S + DICKE__HA_1__IR__S + 
##     DICKE__HA_2__IR__S + DICKE__VB__IR__S + ENTZ__FS_ZW_F1__IR__S + 
##     ENTZ__FS_ZW_F2__IR__S + ENTZ__ZWR1_AL_SN2__IR__S + ENTZ__ZW_OF_AL__IR__S + 
##     KEIL25__FB__IR__S + KEIL40__FB__IR__S + PR_40__FB__IR__S + 
##     RISS__HA_AS__IR__S + RISS__HA_BS__IR__S + TEMP__FB_1__IR__S + 
##     TEMP__FB_3__IR__S + TEMP__HA_1__IR__S + TEMP__HA_4__IR__S + 
##     TEMP__HA_5__IR__S + TEMP__HA__SR__MAX + V__FS_G2__IR__S + 
##     V__FS_G3__IR__S + V__FS_G4__IR__S + V__FS_G5__IR__S + V__FS_G6__IR__S + 
##     V__FS_G7__IR__S + WK__FS_G7__IR__S + WK__VS_SP_3__IR__S + 
##     WSPALT__FS_G7__IR__S
## 
## Fitted party:
## [1] root
## |   [2] DICKE__VB__IR__S <= 1.36037
## |   |   [3] STRANGBREITE <= 2530
## |   |   |   [4] VERTEILERFUELLSTAND <= 78.16667: no.error (n = 1934, err = 0.4%)
## |   |   |   [5] VERTEILERFUELLSTAND > 78.16667
## |   |   |   |   [6] TU_FS_M <= 117.73333: no.error (n = 2398, err = 1.5%)
## |   |   |   |   [7] TU_FS_M > 117.73333
## |   |   |   |   |   [8] TM_FS_M <= 117.3: no.error (n = 15, err = 33.3%)
## |   |   |   |   |   [9] TM_FS_M > 117.3: no.error (n = 197, err = 5.1%)
## |   |   [10] STRANGBREITE > 2530: no.error (n = 15263, err = 0.5%)
## |   [11] DICKE__VB__IR__S > 1.36037: no.error (n = 355, err = 3.1%)
## 
## Number of inner nodes:    5
## Number of terminal nodes: 6
plot(output.tree.c15, 
     main = "Log Class 15 Error Counts Tree",
     gp = gpar(fontsize = 10),
     inner_panel=node_inner,
     ip_args=list(abbreviate = FALSE, id = FALSE)
     )

The above output shows the conditional tree grown for Class 15 errors. The node which has the highest number of errors is node 10, which is 23.1% Class 15 errors. The variables chosen for inner nodes in this tree are listed as follows, without repetition.

  • STRANGBREITE
  • DICKE__VB__IR__S
  • ENTZ__FS_ZW_F2__IR__S
  • TO_LS_M

Again, iteratively growing the tree showed notable amounts of variation in structure, so 3 random forest models, with tree count set to 1000, have been grown below.

2.7.2 Modeling Type 15 Error Counts with random forest

trainSet2 <- trainSet[sample(1:nrow(trainSet), 10000,
    replace=FALSE),]
output.rf.c15 <- party::cforest(C15.pred, data = trainSet2, control = cforest_control(ntree =1000))
var.imp.c15 <- party::varimp(output.rf.c15)
png("anna_tks_rfc15varimp.1.png", res=80, height=800, width=1600) 
barchart(tail(sort(var.imp.c15), 40), xlab="Variable Importance", main="Variable Importance Class 15 Error")
var.imp.c15.1 <- var.imp.c15

Above is the variable importance barchart for the first conditional random forest model for Class 15 errors. Imediately obvious is the fact that most variables chosen by the tree above do not feature in the first 20 variables, excluding STRANGBREITE. This again stresses the variation in the model on the tree level. In order to cope with this variability, two more random forest models are grown below, on the same training split as the model above, and their most important variables are compared.

output.rf.c15 <- party::cforest(C15.pred, data = trainSet2, control = cforest_control(ntree =1000))
var.imp.c15 <- party::varimp(output.rf.c15)
png("anna_tks_rfc15varimp.2.png", res=80, height=800, width=1600) 
barchart(tail(sort(var.imp.c15), 40), xlab="Variable Importance", main="Variable Importance Class 15 Error")
var.imp.c15.2 <- var.imp.c15

output.rf.c15 <- party::cforest(C15.pred, data = trainSet2, control = cforest_control(ntree =1000))
var.imp.c15 <- party::varimp(output.rf.c15)
png("anna_tks_rfc15varimp.3.png", res=80, height=800, width=1600) 
barchart(tail(sort(var.imp.c15), 40), xlab="Variable Importance", main="Variable Importance Class 15 Error")
var.imp.c15.3 <- var.imp.c15

Those variables which ranked in the top 20 most important for each model are listed as follows.

## [1] "TO_SSR_LS"    "TM_FS_M"      "TU_SSL_LS"    "TO_SSL_FS"    "TM_FS_SSR"   
## [6] "KONI_LINKS"   "STRANGBREITE" "TO_FS_M"      "TU_FS_SSR"

The above variables are listed in order of increasing average importance across all three models. When considering the top 40 most important variables for all three models, we find that the variables listed below are shared by all three.

##  [1] "TM_LS_SSR"        "TO_FS_SSR"        "PLATTENDICKE_SSR" "TU_LS_SSL"       
##  [5] "TU_LS_SSR"        "TO_LS_SSL"        "DT_LS"            "TO_LS_SSR"       
##  [9] "TM_SSL_LS"        "TU_FS_M"          "DICKE__VB__IR__S" "TO_LS_M"         
## [13] "TU_SSR_LS"        "TO_SSR_LS"        "TU_SSL_FS"        "TM_FS_M"         
## [17] "TU_SSL_LS"        "TM_SSR_LS"        "TO_SSL_FS"        "TM_FS_SSR"       
## [21] "KONI_LINKS"       "STRANGBREITE"     "TO_FS_M"          "TU_FS_SSR"
df1 <- df1 %>%
  dplyr::select(-C15)

2.8 Comparing Significant Variables across Error Classes

As the errors are all ocurring on the same sheets, it is possible that the errors are, at times, ocurring simultaneously. With this in mind, it is useful to know which, if any, variables are significant for classifying all three error classes.

Variables which were ranked as being in the top 20 most important variables for all three error class models are listed as follows.

## [1] "STRANGBREITE"

Varaibles marked as one of the top 40 significant variables in all three error classes’ conditional random forest models are:

## [1] "DT_LS"        "TU_FS_M"      "TM_LS_SSR"    "TO_FS_SSR"    "TM_SSR_LS"   
## [6] "TO_LS_SSL"    "KONI_LINKS"   "STRANGBREITE"

2.9 Conclusion

In the above analysis, data produced during earlier data reduction and treatment, steps 1-4 avaliable on the Nextcloud server, is treated for possible redundancies produced by highly correlated variables and reviewed for any interesting patterns. The data is then used to create tree and forest models for classifying the presence or lack of three seperate error types — Classes 4, 14, and 15.

2.9.1 Class 4 Conclusion

First, Class 4 errors are transformed using a log(data + 1) transformation. This was done in an attempt to deal with the high levels of skew in the data, or, in other words, to deal with the very low occurancce rate of errors in the data. Class 4 errors were the only class to show notable change under the transformation, as this error type had multiple levels throughout the observations. When modeled with a linear regression model, 18 different variables were found to have significant effect in the model. But, as the model was presenting with an incredibly poor fit, these results were regarded as suspect. In order to create a better fitting model to the data at hand, conditional tree and conditional random forest models were grown for the data.

For the first tree, the log transformation of Class 4 error was again used as the response variable of the model. Although the tree was highly volitile under different seeds, the variables marked as significant were,

  • Log(Class 4) Tree – Significant Variables
    • VG
    • ENTZ__FS_ZW_F1__IR__S
    • KEIL40__FB__IR__S
    • TM_LS_SSL
    • TM_FS_SSR
    • V__FS_G5__IR__S
    • TEMP__HA__SR__MAX
    • TM_SSL_LS

A binary transformation was then created for the Class 4 error variable, which simply marked the presence or absence of an error. This was then provided to a new conditional tree. The tree grown with a binary response generated the following significant variables.

  • Binary(Class 4) Tree – Significant Variables
    • VG
    • TM_LS_SSL
    • V__FS_G5__IR__S
    • POSITION_X
    • TEMP__HA_5__IR__S
    • TM_FS_SSR
    • STOPFENSTELLUNG
    • DICKE__VB__IR__S
    • WASSER_LS
    • TO_SSL_FS
    • ENTZ__FS_ZW_F1__IR__S
    • TUNDISH_POSITION

Again, under different seeds this tree’s structure was found to be highly variable. It is worth noting that these two trees already share some variables of significance. Both models utilize the following variables at various inner nodes.

  • Class 4 Trees - Shared Significant Variables
    • VG
    • ENTZ__FS_ZW_F1__IR__S
    • V__FS_G5__IR__S
    • TM_FS_SSR
    • TM_LS_SSL

Finally, to deal with the variability in the tree structure, three conditional random forest models were created. The first version of this analysis grew forests with only the default 500 trees, but it was found that such models were still highly volitile under different seeds, and as such, the tree count in the forest was increased first to 800, and then finally to the 1000 trees seen grown in all above conditional forest models. The MTRY parameter, which controls the number of variables considered at each split in each tree of the forest was also considered for tuning, but as there is not a built in function for the party package to tune this parameter, it was left at the default 5, to avoid the possibility of overfitting. All three forests were grown from the same training split, to allow for comparibility of output.

Each conditional forest produced slightly different results for variable importantance results, due to the inherent variablity in the creation of the forest model. That being said, the three models produced above still marked a notable number of the same variables as being of comparable importance. Of the top 20 most important variables in each forest, all three models included the following variables. Note that these are listed in order of increasing average importance across all three models, such that the most important variable on average for the three models was VG.

C4Top20  
##  [1] "TM_FS_SSL"       "DT_FS"           "TU_FS_M"         "DT_SSL"         
##  [5] "TM_LS_SSR"       "TM_LS_SSL"       "V__FS_G4__IR__S" "TO_SSL_LS"      
##  [9] "TO_LS_SSL"       "TM_SSR_FS"       "KONI_LINKS"      "STRANGBREITE"   
## [13] "TO_SSL_FS"       "WASSER_SSR"      "WASSER_SSL"      "VG"

In this list we see only one variable which is shared by the two conditional trees: VG.

Of the top 40 variables marked as most important in each model, the three models shared the following.

C4Top40   
##  [1] "TU_FS_SSR"        "TM_LS_M"          "TO_FS_SSL"        "POSITION_X"      
##  [5] "TM_FS_SSR"        "PLATTENDICKE_SSL" "PLATTENDICKE_SSR" "DT_LS"           
##  [9] "TU_SSL_FS"        "TM_SSL_FS"        "TM_FS_SSL"        "TO_SSR_FS"       
## [13] "DT_FS"            "TU_FS_M"          "DT_SSL"           "TM_LS_SSR"       
## [17] "TM_LS_SSL"        "TO_FS_SSR"        "TM_SSR_LS"        "V__FS_G4__IR__S" 
## [21] "TO_SSL_LS"        "TO_LS_SSL"        "TM_SSR_FS"        "KONI_LINKS"      
## [25] "STRANGBREITE"     "TO_SSL_FS"        "WASSER_SSR"       "WASSER_SSL"      
## [29] "VG"

2.9.2 Class 14 Conclusion

As the log transform for Class 14 errors did not appear to have a useful effect on the data set, only the binary transformation was applied and modeled. The tree created for the binary Class 14 error variable was again highly volitile. The tree that was grown under the seed used here only selected one variable, DT_FS, as significant for classifying the response, while each node then contained less than 1% errors. Again, the tree count of the random forest model was boosted to 1000, and three conditional random forests were grown in order to compare results for volitility.

Given inherent expections of variability in output, all three models still selected a notable group of variables which were marked significant in each individual model. They are listed as follows, in order of increasing average importance for all three models.

C14Top20
## [1] "TM_LS_SSL"    "STRANGBREITE" "TU_FS_M"      "TO_LS_SSR"    "TM_LS_M"     
## [6] "WASSER_SSL"   "TO_LS_SSL"    "TU_LS_M"

As one can see, the variable marked as significant in the orginal tree is not in the shared variables for the top 20 variables in the three models. DT_FS is amongst those variables shared within the Top 40 most important variables though, the list of which follows below.

C14Top40  
##  [1] "PLATTENDICKE_SSL"    "TO_SSL_LS"           "TM_FS_M"            
##  [4] "TM_SSL_LS"           "TU_LS_SSR"           "TM_SSR_LS"          
##  [7] "TU_SSR_FS"           "DT_SSR"              "TM_LS_SSR"          
## [10] "TM_SSR_FS"           "TO_SSR_FS"           "TO_FS_SSR"          
## [13] "VERTEILERFUELLSTAND" "WASSER_SSR"          "ARGON_DRUCK_ST"     
## [16] "VG"                  "TM_LS_SSL"           "TU_SSL_LS"          
## [19] "STRANGBREITE"        "TU_FS_M"             "TM_SSL_FS"          
## [22] "TU_SSR_LS"           "KONI_LINKS"          "TO_LS_M"            
## [25] "TO_LS_SSR"           "DT_SSL"              "TM_LS_M"            
## [28] "TO_FS_SSL"           "WASSER_SSL"          "DT_FS"              
## [31] "DT_LS"               "TO_LS_SSL"           "TU_LS_M"

2.9.3 Class 15 Conclusion

Again, the log transformation of the class 15 errors was not useful for dealing with the poor spread of the data, so only the binary response was modeled for this class.

The conditional tree grown under varying seeds was observed to be highly volitile, again due to the low observation rate of class 15 errors and the relatively high number of variables in the data set. Still, it marked 4 different variables – one was used twice at two seperate nodes – as being significant for selecting class 15 errors. Node 9 of this tree caught the majority of the errors in the data set - of the approximately 12.9% of the observations in the training set which were errors, node 9 contains 7.4%

To deal with the large amounts of volitility seen in the tree model, 3 seperate random forest models were grown on the same training set, and their results were compared. In the top 20 variables marked as most important across all three models, those that were shared are listed below in order of increasing average importance for the three models.

C15Top20  
## [1] "TO_SSR_LS"    "TM_FS_M"      "TU_SSL_LS"    "TO_SSL_FS"    "TM_FS_SSR"   
## [6] "KONI_LINKS"   "STRANGBREITE" "TO_FS_M"      "TU_FS_SSR"

The variables which were shared by each model in the list of top 40 most important variables follow bellow.

C15Top40  
##  [1] "TM_LS_SSR"        "TO_FS_SSR"        "PLATTENDICKE_SSR" "TU_LS_SSL"       
##  [5] "TU_LS_SSR"        "TO_LS_SSL"        "DT_LS"            "TO_LS_SSR"       
##  [9] "TM_SSL_LS"        "TU_FS_M"          "DICKE__VB__IR__S" "TO_LS_M"         
## [13] "TU_SSR_LS"        "TO_SSR_LS"        "TU_SSL_FS"        "TM_FS_M"         
## [17] "TU_SSL_LS"        "TM_SSR_LS"        "TO_SSL_FS"        "TM_FS_SSR"       
## [21] "KONI_LINKS"       "STRANGBREITE"     "TO_FS_M"          "TU_FS_SSR"

Finally, the variables shared by all three error class model sets as being in the top 20 most important variables are listed in increasing order of importance, as follows.

AllTop20
## [1] "STRANGBREITE"

2.10 References

  • Hothorn, Torsten, et al. “Ctree: Conditional Inference Trees.” Ctree: Conditional Inference Trees, Cran, cran.r-project.org/web/packages/partykit/vignettes/ctree.pdf.

  • Jackson, Simon. “Exploring Correlations in R with Corrr . BlogR.” BlogR on Svbtle, 21 Aug. 2018, drsimonj.svbtle.com/exploring-correlations-in-r-with-corrr.

  • Zhu, Hao. “Package ‘KableExtra.’” KableExtra.pdf, Cran, 22 Jan. 2019, cran.r-project.org/web/packages/kableExtra/kableExtra.pdf.